[jira] [Created] (HUDI-6586) Add Incremental scan support to dbt
Vinoth Govindarajan created HUDI-6586: - Summary: Add Incremental scan support to dbt Key: HUDI-6586 URL: https://issues.apache.org/jira/browse/HUDI-6586 Project: Apache Hudi Issue Type: Epic Components: connectors Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 1.0.0 The current dbt support adds only the basic hudi primitives, but with deeper integration we could enable faster ETL queries using the incremental read primitive similar to the deltastreamer support. The goal of this epic is to enable incremental data processing for dbt. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4254) Refactor SnowflakeSyncTool and BigQuerySyncTool
Vinoth Govindarajan created HUDI-4254: - Summary: Refactor SnowflakeSyncTool and BigQuerySyncTool Key: HUDI-4254 URL: https://issues.apache.org/jira/browse/HUDI-4254 Project: Apache Hudi Issue Type: Task Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan There are many similarities between SnowflakeSyncTool and BigQuerySyncTool, refactor the common methods to create an Abstract class then use the same to avoid code duplication. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4220) Create a utility to convert json records to avro
Vinoth Govindarajan created HUDI-4220: - Summary: Create a utility to convert json records to avro Key: HUDI-4220 URL: https://issues.apache.org/jira/browse/HUDI-4220 Project: Apache Hudi Issue Type: Task Components: spark Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan We need a utility to convert topics of many gzip files of JSON lines into Avro. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Integrated Hudi with Snowflake
[ https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2832: -- Summary: [Umbrella] [RFC-40] Integrated Hudi with Snowflake (was: [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration) > [Umbrella] [RFC-40] Integrated Hudi with Snowflake > --- > > Key: HUDI-2832 > URL: https://issues.apache.org/jira/browse/HUDI-2832 > Project: Apache Hudi > Issue Type: Epic > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.12.0 > > > Snowflake is a fully managed service that’s simple to use but can power a > near-unlimited number of concurrent workloads. Snowflake is a solution for > data warehousing, data lakes, data engineering, data science, data > application development, and securely sharing and consuming shared data. > Snowflake [doesn’t > support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] > Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta > file format. This proposal is to implement a SnowflakeSync similar to > HiveSync to sync the Hudi table as the Snowflake External Parquet table so > that users can query the Hudi tables using Snowflake. Many users have > expressed interest in Hudi and other support channels asking to integrate > Hudi with Snowflake, this will unlock new use cases for Hudi. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4137) Implement SnowflakeSyncTool to support Hudi to Snowflake Integration
Vinoth Govindarajan created HUDI-4137: - Summary: Implement SnowflakeSyncTool to support Hudi to Snowflake Integration Key: HUDI-4137 URL: https://issues.apache.org/jira/browse/HUDI-4137 Project: Apache Hudi Issue Type: Task Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 0.12.0 Implement SnowflakeSyncTool similar to the BigQuerySyncTool to support Hudi to Snowflake Integration -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4000) Docs around DBT
[ https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-4000: -- Status: Patch Available (was: In Progress) > Docs around DBT > --- > > Key: HUDI-4000 > URL: https://issues.apache.org/jira/browse/HUDI-4000 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Ethan Guo >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: pull-request-available > Fix For: 0.11.1 > > > [https://github.com/apache/hudi/issues/5367] > Do we have any step by step document on how to use Hudi in conjunction with > DBT. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4000) Docs around DBT
[ https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-4000: -- Status: In Progress (was: Open) > Docs around DBT > --- > > Key: HUDI-4000 > URL: https://issues.apache.org/jira/browse/HUDI-4000 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Ethan Guo >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: pull-request-available > Fix For: 0.11.1 > > > [https://github.com/apache/hudi/issues/5367] > Do we have any step by step document on how to use Hudi in conjunction with > DBT. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-1743. - Resolution: Fixed > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available, sev:normal > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Reopened] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reopened HUDI-1743: --- > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available, sev:normal > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan resolved HUDI-1743. --- > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available, sev:normal > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
[ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-1790. - Resolution: Fixed > Add SqlSource for DeltaStreamer to support backfill use cases > - > > Key: HUDI-1790 > URL: https://issues.apache.org/jira/browse/HUDI-1790 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available, sev:normal > > Delta Streamer is great for incremental workloads, but we need to support > backfills for use cases like adding a new column and backfill only that > column for the last 6 months, and if there was a bug in our transformation > logic and we need to reprocess a couple of older partitions. > > If we have a SqlSource as one of the input source to the delta streamer, then > I can pass any custom Spark SQL queries selecting specific partitions and > backfill. > > When we do the backfill, we don't need to update the last processed commit > checkpoint, this has to copy the last processed checkpoint before the > backfill and copy that over to the backfill commit. > > cc [~nishith29] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan resolved HUDI-1762. --- > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-1762. - Resolution: Fixed > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Reopened] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reopened HUDI-1762: --- > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Reopened] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
[ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reopened HUDI-1790: --- > Add SqlSource for DeltaStreamer to support backfill use cases > - > > Key: HUDI-1790 > URL: https://issues.apache.org/jira/browse/HUDI-1790 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available, sev:normal > > Delta Streamer is great for incremental workloads, but we need to support > backfills for use cases like adding a new column and backfill only that > column for the last 6 months, and if there was a bug in our transformation > logic and we need to reprocess a couple of older partitions. > > If we have a SqlSource as one of the input source to the delta streamer, then > I can pass any custom Spark SQL queries selecting specific partitions and > backfill. > > When we do the backfill, we don't need to update the last processed commit > checkpoint, this has to copy the last processed checkpoint before the > backfill and copy that over to the backfill commit. > > cc [~nishith29] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
[ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan resolved HUDI-1790. --- > Add SqlSource for DeltaStreamer to support backfill use cases > - > > Key: HUDI-1790 > URL: https://issues.apache.org/jira/browse/HUDI-1790 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available, sev:normal > > Delta Streamer is great for incremental workloads, but we need to support > backfills for use cases like adding a new column and backfill only that > column for the last 6 months, and if there was a bug in our transformation > logic and we need to reprocess a couple of older partitions. > > If we have a SqlSource as one of the input source to the delta streamer, then > I can pass any custom Spark SQL queries selecting specific partitions and > backfill. > > When we do the backfill, we don't need to update the last processed commit > checkpoint, this has to copy the last processed checkpoint before the > backfill and copy that over to the backfill commit. > > cc [~nishith29] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (HUDI-2319) Integrate hudi with dbt (data build tool)
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-2319. - Resolution: Fixed > Integrate hudi with dbt (data build tool) > - > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Critical > Labels: integration, pull-request-available > Fix For: 0.10.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-3838) Make Drop partition column config work with deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan resolved HUDI-3838. --- > Make Drop partition column config work with deltastreamer > - > > Key: HUDI-3838 > URL: https://issues.apache.org/jira/browse/HUDI-3838 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > hoodie.datasource.write.drop.partition.columns only works for datasource > writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> > bigquery sync flow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3838) Make Drop partition column config work with deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521370#comment-17521370 ] Vinoth Govindarajan commented on HUDI-3838: --- Added the support, see the linked PR for more details. > Make Drop partition column config work with deltastreamer > - > > Key: HUDI-3838 > URL: https://issues.apache.org/jira/browse/HUDI-3838 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > hoodie.datasource.write.drop.partition.columns only works for datasource > writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> > bigquery sync flow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3838) Make Drop partition column config work with deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-3838. - Resolution: Fixed > Make Drop partition column config work with deltastreamer > - > > Key: HUDI-3838 > URL: https://issues.apache.org/jira/browse/HUDI-3838 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > hoodie.datasource.write.drop.partition.columns only works for datasource > writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> > bigquery sync flow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3838) Make Drop partition column config work with deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3838: -- Status: Patch Available (was: In Progress) > Make Drop partition column config work with deltastreamer > - > > Key: HUDI-3838 > URL: https://issues.apache.org/jira/browse/HUDI-3838 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > hoodie.datasource.write.drop.partition.columns only works for datasource > writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> > bigquery sync flow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2438: -- Status: In Progress (was: Open) > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: Epic > Components: Common Core, meta-sync >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.11.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > More details are in RFC-34: > [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3371) Add Dremio integration support to hudi
[ https://issues.apache.org/jira/browse/HUDI-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3371: -- Issue Type: New Feature (was: Task) > Add Dremio integration support to hudi > -- > > Key: HUDI-3371 > URL: https://issues.apache.org/jira/browse/HUDI-3371 > Project: Apache Hudi > Issue Type: New Feature >Reporter: sivabalan narayanan >Assignee: Vinoth Govindarajan >Priority: Major > > For BI reports and dashboards, dremio gives accelerated queries. So adding > dremio integration w/ hudi will be a added value to hudi users in general. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516608#comment-17516608 ] Vinoth Govindarajan edited comment on HUDI-2438 at 4/4/22 3:07 AM: --- Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, [~joyansil] implemented a Java client for this, the details are in this ticket: https://issues.apache.org/jira/browse/HUDI-3020 BigQuerySyncTool is also available, it will be part of 0.11.0 release, which already invoke the manifest generation code as part of the sync method. Here is how you can test it out: {code:java} import org.apache.hadoop.conf.Configuration; import org.apache.hudi.metadata.ManifestFileUtilval conf = new Configuration();{code} {code:java} val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi" val manifestFileUtil: ManifestFileUtil = ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build(); manifestFileUtil.writeManifestFile() {code} {code:java} gsutil cat gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/.hoodie/manifest/latest-snapshot.csv%7Chead -5{code} {code:java} gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/95e78133-08dd-4721-9c49-8fbe338589f0-0_621-10-1021_20210927173651.parquet gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/1c01ddf1-41e8-43bf-94f3-9a75c9e39b21-0_1238-12-1638_20210927173651.parquet gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/a021864b-d0b6-462a-8f47-6370668993d6-0_694-12-1094_20210927173651.parquet gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/11815af9-6ed5-4f05-9b6b-7f79479f4f53-0_625-12-1025_20210927173651.parquet gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/ef4fc799-9124-4656-8106-f0374590dde6-0_1151-12-1551_20210927173651.parquet {code} was (Author: vino): Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, [~joyansil] implemented a Java client for this, the details are in this ticket: https://issues.apache.org/jira/browse/HUDI-3020 BigQuerySyncTool is also available, it will be part of 0.11.0 release, which already invoke the manifest generation code as part of the sync method. Here is how you can test it out: {code:java} import org.apache.hadoop.conf.Configuration; import org.apache.hudi.metadata.ManifestFileUtilval conf = new Configuration();{code} {code:java} val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi" val manifestFileUtil: ManifestFileUtil = ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build(); manifestFileUtil.writeManifestFile() {code} > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: Epic > Components: Common Core, meta-sync >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.11.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > More details are in RFC-34: > [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516608#comment-17516608 ] Vinoth Govindarajan commented on HUDI-2438: --- Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, [~joyansil] implemented a Java client for this, the details are in this ticket: https://issues.apache.org/jira/browse/HUDI-3020 BigQuerySyncTool is also available, it will be part of 0.11.0 release, which already invoke the manifest generation code as part of the sync method. Here is how you can test it out: {code:java} import org.apache.hadoop.conf.Configuration; import org.apache.hudi.metadata.ManifestFileUtilval conf = new Configuration();{code} {code:java} val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi" val manifestFileUtil: ManifestFileUtil = ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build(); manifestFileUtil.writeManifestFile() {code} > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: Epic > Components: Common Core, meta-sync >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.11.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > More details are in RFC-34: > [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3748) Hudi fails to insert into a partitioned table when the partition column is dropped from the parquet schema
[ https://issues.apache.org/jira/browse/HUDI-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3748: -- Description: When you add this config to drop the partition column from the parquet schema to support BigQuery, hudi fails to insert with the following error. Steps to reproduce: Start hudi docker: {code:java} cd hudi/docker ./setup_demo.sh{code} {code:java} docker exec -it adhoc-2 /bin/bash{code} {code:java} # Log into spark-sql and execute the following commands: spark-sql --jars $HUDI_SPARK_BUNDLE \ --master local[2] \ --driver-class-path $HADOOP_CONF_DIR \ --conf spark.sql.hive.convertMetastoreParquet=false \ --deploy-mode client \ --driver-memory 1G \ --executor-memory 3G \ --num-executors 1 \ --packages org.apache.spark:spark-avro_2.11:2.4.4 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' {code} {code:java} create table bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ); insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(1, 'a1', 10, current_timestamp()); insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 20, current_timestamp()); {code} Error: {code:java} 22/03/29 20:58:02 INFO spark.SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Got job 63 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 131 (collect at HoodieSparkEngineContext.java:100) 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Missing parents: List() 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting ResultStage 131 (MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100), which has no missing parents 22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89 stored as values in memory (estimated size 71.9 KB, free 364.0 MB) 22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89_piece0 stored as bytes in memory (estimated size 26.3 KB, free 364.0 MB) 22/03/29 20:58:02 INFO storage.BlockManagerInfo: Added broadcast_89_piece0 in memory on adhoc-1:38703 (size: 26.3 KB, free: 365.7 MB) 22/03/29 20:58:02 INFO spark.SparkContext: Created broadcast 89 from broadcast at DAGScheduler.scala:1161 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 131 (MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0)) 22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Adding task set 131.0 with 1 tasks 22/03/29 20:58:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 131.0 (TID 4081, localhost, executor driver, partition 0, PROCESS_LOCAL, 7803 bytes) 22/03/29 20:58:02 INFO executor.Executor: Running task 0.0 in stage 131.0 (TID 4081) 22/03/29 20:58:02 INFO executor.Executor: Finished task 0.0 in stage 131.0 (TID 4081). 1167 bytes result sent to driver 22/03/29 20:58:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 131.0 (TID 4081) in 17 ms on localhost (executor driver) (1/1) 22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 131.0, whose tasks have all completed, from pool 22/03/29 20:58:02 INFO scheduler.DAGScheduler: ResultStage 131 (collect at HoodieSparkEngineContext.java:100) finished in 0.030 s 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Job 63 finished: collect at HoodieSparkEngineContext.java:100, took 0.032364 s 22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20220329205734338__commit__COMPLETED]} 22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 22/03/29 20:58:02 INFO util.ClusteringUtils: Found 0 files in pending clustering operations 22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0 22/03/29 20:58:02 INFO hudi.HoodieFileIndex: Refresh table bq_demo_partitioned_cow, spend: 124 ms 22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow 22/03/29
[jira] [Created] (HUDI-3748) Hudi fails to insert into a partitioned table when the partition column is dropped from the parquet schema
Vinoth Govindarajan created HUDI-3748: - Summary: Hudi fails to insert into a partitioned table when the partition column is dropped from the parquet schema Key: HUDI-3748 URL: https://issues.apache.org/jira/browse/HUDI-3748 Project: Apache Hudi Issue Type: Bug Reporter: Vinoth Govindarajan When you add this config to drop the partition column from the parquet schema to support BigQuery, hudi fails to insert with the following error. Steps to reproduce: Log into spark-sql and execute the following commands: {code:java} create table bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ); insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(1, 'a1', 10, current_timestamp()); insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 20, current_timestamp()); {code} Error: {code:java} 22/03/29 20:58:02 INFO spark.SparkContext: Starting job: collect at HoodieSparkEngineContext.java:100 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Got job 63 (collect at HoodieSparkEngineContext.java:100) with 1 output partitions 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 131 (collect at HoodieSparkEngineContext.java:100) 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Missing parents: List() 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting ResultStage 131 (MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100), which has no missing parents 22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89 stored as values in memory (estimated size 71.9 KB, free 364.0 MB) 22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89_piece0 stored as bytes in memory (estimated size 26.3 KB, free 364.0 MB) 22/03/29 20:58:02 INFO storage.BlockManagerInfo: Added broadcast_89_piece0 in memory on adhoc-1:38703 (size: 26.3 KB, free: 365.7 MB) 22/03/29 20:58:02 INFO spark.SparkContext: Created broadcast 89 from broadcast at DAGScheduler.scala:1161 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 131 (MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0)) 22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Adding task set 131.0 with 1 tasks 22/03/29 20:58:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 131.0 (TID 4081, localhost, executor driver, partition 0, PROCESS_LOCAL, 7803 bytes) 22/03/29 20:58:02 INFO executor.Executor: Running task 0.0 in stage 131.0 (TID 4081) 22/03/29 20:58:02 INFO executor.Executor: Finished task 0.0 in stage 131.0 (TID 4081). 1167 bytes result sent to driver 22/03/29 20:58:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 131.0 (TID 4081) in 17 ms on localhost (executor driver) (1/1) 22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 131.0, whose tasks have all completed, from pool 22/03/29 20:58:02 INFO scheduler.DAGScheduler: ResultStage 131 (collect at HoodieSparkEngineContext.java:100) finished in 0.030 s 22/03/29 20:58:02 INFO scheduler.DAGScheduler: Job 63 finished: collect at HoodieSparkEngineContext.java:100, took 0.032364 s 22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20220329205734338__commit__COMPLETED]} 22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 22/03/29 20:58:02 INFO util.ClusteringUtils: Found 0 files in pending clustering operations 22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0 22/03/29 20:58:02 INFO hudi.HoodieFileIndex: Refresh table bq_demo_partitioned_cow, spend: 124 ms 22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow 22/03/29 20:58:02 INFO table.HoodieTableConfig: Loading table properties from hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow/.hoodie/hoodie.properties 22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow 22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline:
[jira] [Updated] (HUDI-3020) Create a utility to generate mainfest file
[ https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3020: -- Status: Patch Available (was: In Progress) > Create a utility to generate mainfest file > -- > > Key: HUDI-3020 > URL: https://issues.apache.org/jira/browse/HUDI-3020 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Govindarajan >Assignee: Joyan Sil >Priority: Major > Labels: pull-request-available > > Create a utility to generate manifest file which contains the latest snapshot > files for each partition in a CSV format with only one column filename. > > This is the first step towards integrating hudi with snowflake. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3020) Create a utility to generate mainfest file
[ https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3020: -- Epic Link: HUDI-2438 (was: HUDI-2832) > Create a utility to generate mainfest file > -- > > Key: HUDI-3020 > URL: https://issues.apache.org/jira/browse/HUDI-3020 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Govindarajan >Assignee: Joyan Sil >Priority: Major > Labels: pull-request-available > > Create a utility to generate manifest file which contains the latest snapshot > files for each partition in a CSV format with only one column filename. > > This is the first step towards integrating hudi with snowflake. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3357) Implement the BigQuerySyncTool
[ https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-3357: -- Status: Patch Available (was: In Progress) > Implement the BigQuerySyncTool > -- > > Key: HUDI-3357 > URL: https://issues.apache.org/jira/browse/HUDI-3357 > Project: Apache Hudi > Issue Type: New Feature > Components: meta-sync, reader-core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi > table into BigQuery table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3357) Implement the BigQuerySyncTool
[ https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512104#comment-17512104 ] Vinoth Govindarajan commented on HUDI-3357: --- [~xushiyan] - Thanks for the remainder and being flexible to accommodate this work as part of the 0.11.0 release, here is the MVP version of the PR: [https://github.com/apache/hudi/pull/5125] I will add more unit tests in a few days. > Implement the BigQuerySyncTool > -- > > Key: HUDI-3357 > URL: https://issues.apache.org/jira/browse/HUDI-3357 > Project: Apache Hudi > Issue Type: New Feature > Components: meta-sync, reader-core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Blocker > Fix For: 0.11.0 > > > Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi > table into BigQuery table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3020) Create a utility to generate mainfest file
[ https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-3020: - Assignee: Prashant Wason (was: Vinoth Govindarajan) > Create a utility to generate mainfest file > -- > > Key: HUDI-3020 > URL: https://issues.apache.org/jira/browse/HUDI-3020 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Govindarajan >Assignee: Prashant Wason >Priority: Major > > Create a utility to generate manifest file which contains the latest snapshot > files for each partition in a CSV format with only one column filename. > > This is the first step towards integrating hudi with snowflake. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3570) [Umbrella] [RFC-TBD] Integrate Hudi with Dremio
Vinoth Govindarajan created HUDI-3570: - Summary: [Umbrella] [RFC-TBD] Integrate Hudi with Dremio Key: HUDI-3570 URL: https://issues.apache.org/jira/browse/HUDI-3570 Project: Apache Hudi Issue Type: Epic Components: connectors Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan In many large enterprises, especially banks, there is a strong DEMAND for BI. Leaders need to look at BI charts every day. As data gradually increases, BI queries are slow, which will drive the acceleration of queries. Pre-accelerated data is required to meet requirements. Dremio can meet this need by analyzing historical SQL and automating accelerated data. Hudi has a lot of features that Iceberg doesn't, so we tend to use hudi. But Dremio does not currently support Hudi. Support Issue in GH: https://github.com/apache/hudi/issues/4627 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3371) Add Dremio integration support to hudi
[ https://issues.apache.org/jira/browse/HUDI-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-3371: - Assignee: Vinoth Govindarajan > Add Dremio integration support to hudi > -- > > Key: HUDI-3371 > URL: https://issues.apache.org/jira/browse/HUDI-3371 > Project: Apache Hudi > Issue Type: Task >Reporter: sivabalan narayanan >Assignee: Vinoth Govindarajan >Priority: Major > > For BI reports and dashboards, dremio gives accelerated queries. So adding > dremio integration w/ hudi will be a added value to hudi users in general. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3357) Implement the BigQuerySyncTool
Vinoth Govindarajan created HUDI-3357: - Summary: Implement the BigQuerySyncTool Key: HUDI-3357 URL: https://issues.apache.org/jira/browse/HUDI-3357 Project: Apache Hudi Issue Type: New Feature Components: hive-sync Reporter: Vinoth Govindarajan Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi table into BigQuery table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3357) Implement the BigQuerySyncTool
[ https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-3357: - Assignee: Vinoth Govindarajan > Implement the BigQuerySyncTool > -- > > Key: HUDI-3357 > URL: https://issues.apache.org/jira/browse/HUDI-3357 > Project: Apache Hudi > Issue Type: New Feature > Components: hive-sync >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > > Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi > table into BigQuery table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3290) Make the .hoodie-partition-metadata as empty parquet file
[ https://issues.apache.org/jira/browse/HUDI-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-3290: - Assignee: Prashant Wason (was: Vinoth Govindarajan) > Make the .hoodie-partition-metadata as empty parquet file > - > > Key: HUDI-3290 > URL: https://issues.apache.org/jira/browse/HUDI-3290 > Project: Apache Hudi > Issue Type: New Feature > Components: metadata >Reporter: Vinoth Govindarajan >Assignee: Prashant Wason >Priority: Major > > For BigQuery and Snowflake integration, we can't able to create external > tables when the partition folder has a non-parquet file > `.hoodie-partition-metadata`. > I understand this is an important file to find the .hoodie folder from within > the partition folder, the long term solution is to get rid of this file, but > as a short term solution if we can convert this to an empty parquet file and > add the necessary depth information in the footer, then it will pass the > BigQuery/Snowflake external table validation and allow us to create an > external parquet table on top of hudi folder structure. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3290) Make the .hoodie-partition-metadata as empty parquet file
Vinoth Govindarajan created HUDI-3290: - Summary: Make the .hoodie-partition-metadata as empty parquet file Key: HUDI-3290 URL: https://issues.apache.org/jira/browse/HUDI-3290 Project: Apache Hudi Issue Type: New Feature Components: metadata Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan For BigQuery and Snowflake integration, we can't able to create external tables when the partition folder has a non-parquet file `.hoodie-partition-metadata`. I understand this is an important file to find the .hoodie folder from within the partition folder, the long term solution is to get rid of this file, but as a short term solution if we can convert this to an empty parquet file and add the necessary depth information in the footer, then it will pass the BigQuery/Snowflake external table validation and allow us to create an external parquet table on top of hudi folder structure. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859 ] Vinoth Govindarajan edited comment on HUDI-2438 at 12/27/21, 8:01 PM: -- Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this is a general idea: * Let's say you have a Hudi table data on google cloud storage (GCS). {code:java} create table dwh.bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) options ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ) location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code} BQ doesn't accept the partition column in the parquet schema, hence we need to drop the partition columns from the schema by enabling this flag: {code:java} hoodie.datasource.write.drop.partition.columns = 'true'{code} * Generate a manifest file for the Hudi table which has the list of the latest snapshot parquet file names in a CSV format with only one column the file name. The location of the manifest file should be on the .hoodie metadata folder (`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`) {code:java} // this command is coming soon. GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code} * Create a BQ table named `table_name_manifest` with only one column filename with this location `gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest` ( filename STRING ) OPTIONS( format="CSV", uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"] );{code} * Create another BQ table named `table_name_history` with this location `gs://bucket_name/table_name`, don't use this table to query the data, this table will have duplicate records since it scans all the versions of parquet files in the table/partition folders. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WITH PARTITION COLUMNS OPTIONS( ignore_unknown_values=true, format="PARQUET", hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/", uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"] );{code} * Create a BQ view named `table_name` with this query: {code:java} CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow` AS SELECT * FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WHERE _hoodie_file_name IN ( SELECT filename FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code} * The last view you created has the data from the Hudi table without any duplicates, you can use that table to query the data. To make this model work, we need a way to generate the manifest file with the latest snapshot files, I will create a PR for that soon. One more final step, Hudi generates multiple non-parquet files in the table/partition location like (crc + .hoodie_partition_metadata): {code:java} ..hoodie_partition_metadata.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc .hoodie_partition_metadata a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code} We need to remove these non-parquet files to make the BQ integration work, I will talk to the Hudi core team on how to get rid of these files from the partition location. was (Author: vino): Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this is a general idea: * Let's say you have a Hudi table data on google storage (GCS). {code:java} create table dwh.bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) options ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ) location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code} BQ doesn't accept the partition column in the parquet schema, hence we need to drop the partition columns from the schema by enabling this
[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859 ] Vinoth Govindarajan edited comment on HUDI-2438 at 12/27/21, 8:00 PM: -- Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this is a general idea: * Let's say you have a Hudi table data on google storage (GCS). {code:java} create table dwh.bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) options ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ) location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code} BQ doesn't accept the partition column in the parquet schema, hence we need to drop the partition columns from the schema by enabling this flag: {code:java} hoodie.datasource.write.drop.partition.columns = 'true'{code} * Generate a manifest file for the Hudi table which has the list of the latest snapshot parquet file names in a CSV format with only one column the file name. The location of the manifest file should be on the .hoodie metadata folder (`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`) {code:java} // this command is coming soon. GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code} * Create a BQ table named `table_name_manifest` with only one column filename with this location `gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest` ( filename STRING ) OPTIONS( format="CSV", uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"] );{code} * Create another BQ table named `table_name_history` with this location `gs://bucket_name/table_name`, don't use this table to query the data, this table will have duplicate records since it scans all the versions of parquet files in the table/partition folders. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WITH PARTITION COLUMNS OPTIONS( ignore_unknown_values=true, format="PARQUET", hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/", uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"] );{code} * Create a BQ view named `table_name` with this query: {code:java} CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow` AS SELECT * FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WHERE _hoodie_file_name IN ( SELECT filename FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code} * The last view you created has the data from the Hudi table without any duplicates, you can use that table to query the data. To make this model work, we need a way to generate the manifest file with the latest snapshot files, I will create a PR for that soon. One more final step, Hudi generates multiple non-parquet files in the table/partition location like (crc + .hoodie_partition_metadata): {code:java} ..hoodie_partition_metadata.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc .hoodie_partition_metadata a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code} We need to remove these non-parquet files to make the BQ integration work, I will talk to the Hudi core team on how to get rid of these files from the partition location. was (Author: vino): Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this is a general idea: # Let's say you have a Hudi table data on google storage (GCS). {code:java} create table dwh.bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) options ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ) location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code} BQ doesn't accept the partition column in the parquet schema, hence we need to drop the partition columns from the schema by enabling this flag:
[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859 ] Vinoth Govindarajan commented on HUDI-2438: --- Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this is a general idea: # Let's say you have a Hudi table data on google storage (GCS). {code:java} create table dwh.bq_demo_partitioned_cow ( id bigint, name string, price double, ts bigint, dt string ) using hudi partitioned by (dt) options ( type = 'cow', primaryKey = 'id', preCombineField = 'ts', hoodie.datasource.write.drop.partition.columns = 'true' ) location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code} BQ doesn't accept the partition column in the parquet schema, hence we need to drop the partition columns from the schema by enabling this flag: {code:java} hoodie.datasource.write.drop.partition.columns = 'true'{code} # Generate a manifest file for the Hudi table which has the list of the latest snapshot parquet file names in a CSV format with only one column the file name. The location of the manifest file should be on the .hoodie metadata folder (`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`) {code:java} // this command is coming soon. GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code} # Create a BQ table named `table_name_manifest` with only one column filename with this location `gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest` ( filename STRING ) OPTIONS( format="CSV", uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"] );{code} # Create another BQ table named `table_name_history` with this location `gs://bucket_name/table_name`, don't use this table to query the data, this table will have duplicate records since it scans all the versions of parquet files in the table/partition folders. {code:java} CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WITH PARTITION COLUMNS OPTIONS( ignore_unknown_values=true, format="PARQUET", hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/", uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"] );{code} # Create a BQ view named `table_name` with this query: {code:java} CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow` AS SELECT * FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_history` WHERE _hoodie_file_name IN ( SELECT filename FROM `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code} # The last view you created has the data from the Hudi table without any duplicates, you can use that table to query the data. To make this model work, we need a way to generate the manifest file with the latest snapshot files, I will create a PR for that soon. One more final step, Hudi generates multiple non-parquet files in the table/partition location like (crc + .hoodie_partition_metadata): {code:java} ..hoodie_partition_metadata.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc .a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc .hoodie_partition_metadata a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code} We need to remove these non-parquet files to make the BQ integration work, I will talk to the Hudi core team on how to get rid of these files from the partition location. > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.11.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as
[jira] [Created] (HUDI-3091) Make simple index as the default hoodie.index.type
Vinoth Govindarajan created HUDI-3091: - Summary: Make simple index as the default hoodie.index.type Key: HUDI-3091 URL: https://issues.apache.org/jira/browse/HUDI-3091 Project: Apache Hudi Issue Type: New Feature Components: Index Reporter: Vinoth Govindarajan When performing upserts with derived datasets, we often run into an OOM issue with the bloom filter, hence we changed all the dataset index types to simple to resolve the issue. Some of the tables were non-partitioned tables for which bloom index is not the right choice. I'm proposing to make a simple index as the default value and on case-by-case basics, folks can choose the bloom filter for additional performance gains offered by bloom filters. I agree that the performance will not be optimal but for regular use cases simple index would not break and give them sub-optimal read/write performance but it won't break any ingestion/derived jobs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3020) Create a utility to generate mainfest file
Vinoth Govindarajan created HUDI-3020: - Summary: Create a utility to generate mainfest file Key: HUDI-3020 URL: https://issues.apache.org/jira/browse/HUDI-3020 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Create a utility to generate manifest file which contains the latest snapshot files for each partition in a CSV format with only one column filename. This is the first step towards integrating hudi with snowflake. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration
[ https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2832: -- Status: In Progress (was: Open) > [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake > Integration > > > Key: HUDI-2832 > URL: https://issues.apache.org/jira/browse/HUDI-2832 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration, pull-request-available > Fix For: 0.11.0 > > > Snowflake is a fully managed service that’s simple to use but can power a > near-unlimited number of concurrent workloads. Snowflake is a solution for > data warehousing, data lakes, data engineering, data science, data > application development, and securely sharing and consuming shared data. > Snowflake [doesn’t > support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] > Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta > file format. This proposal is to implement a SnowflakeSync similar to > HiveSync to sync the Hudi table as the Snowflake External Parquet table so > that users can query the Hudi tables using Snowflake. Many users have > expressed interest in Hudi and other support channels asking to integrate > Hudi with Snowflake, this will unlock new use cases for Hudi. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2319: -- Fix Version/s: 0.10.0 (was: 0.11.0) > Integrate hudi with dbt (data build tool) > - > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Critical > Labels: integration > Fix For: 0.10.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2319: -- Status: Resolved (was: Patch Available) > Integrate hudi with dbt (data build tool) > - > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Critical > Labels: integration > Fix For: 0.10.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2319) Integrate hudi with dbt (data build tool)
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454963#comment-17454963 ] Vinoth Govindarajan commented on HUDI-2319: --- Hudi support has been added to dbt, here is the dbt-spark [PR|https://github.com/dbt-labs/dbt-spark/pull/210]. > Integrate hudi with dbt (data build tool) > - > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Critical > Labels: integration > Fix For: 0.11.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration
[ https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2832: -- Description: Snowflake is a fully managed service that’s simple to use but can power a near-unlimited number of concurrent workloads. Snowflake is a solution for data warehousing, data lakes, data engineering, data science, data application development, and securely sharing and consuming shared data. Snowflake [doesn’t support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta file format. This proposal is to implement a SnowflakeSync similar to HiveSync to sync the Hudi table as the Snowflake External Parquet table so that users can query the Hudi tables using Snowflake. Many users have expressed interest in Hudi and other support channels asking to integrate Hudi with Snowflake, this will unlock new use cases for Hudi. (was: BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. More details are in RFC-34: [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]) > [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake > Integration > - > > Key: HUDI-2832 > URL: https://issues.apache.org/jira/browse/HUDI-2832 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.11.0 > > > Snowflake is a fully managed service that’s simple to use but can power a > near-unlimited number of concurrent workloads. Snowflake is a solution for > data warehousing, data lakes, data engineering, data science, data > application development, and securely sharing and consuming shared data. > Snowflake [doesn’t > support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] > Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta > file format. This proposal is to implement a SnowflakeSync similar to > HiveSync to sync the Hudi table as the Snowflake External Parquet table so > that users can query the Hudi tables using Snowflake. Many users have > expressed interest in Hudi and other support channels asking to integrate > Hudi with Snowflake, this will unlock new use cases for Hudi. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration
[ https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2832: -- Summary: [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration (was: [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration) > [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake > Integration > > > Key: HUDI-2832 > URL: https://issues.apache.org/jira/browse/HUDI-2832 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.11.0 > > > Snowflake is a fully managed service that’s simple to use but can power a > near-unlimited number of concurrent workloads. Snowflake is a solution for > data warehousing, data lakes, data engineering, data science, data > application development, and securely sharing and consuming shared data. > Snowflake [doesn’t > support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] > Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta > file format. This proposal is to implement a SnowflakeSync similar to > HiveSync to sync the Hudi table as the Snowflake External Parquet table so > that users can query the Hudi tables using Snowflake. Many users have > expressed interest in Hudi and other support channels asking to integrate > Hudi with Snowflake, this will unlock new use cases for Hudi. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration
Vinoth Govindarajan created HUDI-2832: - Summary: [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration Key: HUDI-2832 URL: https://issues.apache.org/jira/browse/HUDI-2832 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 0.11.0 BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. More details are in RFC-34: [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447608#comment-17447608 ] Vinoth Govindarajan commented on HUDI-2438: --- Hi [~qiao.xu]! After investigation, I've found out that the BigQuery doesn't support any kind of manifest file to read the metadata of which file to read, so it's making this implementation very difficult, now the only option is to select the set of correct parquet files for each and every commit and sync only those files to BigQuery GFS so it would break snapshot isolation and other benefits of Hudi time-travel. I'm focusing on other initiatives like dbt and snowflake integration, I'll look into this after completing both those initiatives. > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.11.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > More details are in RFC-34: > [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-2681) Make hoodie record_key and preCombine_key optional
[ https://issues.apache.org/jira/browse/HUDI-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-2681: - Assignee: sivabalan narayanan > Make hoodie record_key and preCombine_key optional > -- > > Key: HUDI-2681 > URL: https://issues.apache.org/jira/browse/HUDI-2681 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: sivabalan narayanan >Priority: Major > Labels: 0.10_blocker > > At present, Hudi needs an record key and preCombine key to create an Hudi > datasets, which puts an restriction on the kinds of datasets we can create > using Hudi. > > In order to increase the adoption of Hudi file format across all kinds of > derived datasets, similar to Parquet/ORC, we need to offer flexibility to > users. I understand that record key is used for upsert primitive and we need > preCombine key to break the tie and deduplicate, but there are event data and > other datasets without any primary key (append only datasets), which can > benefit from Hudi since Hudi ecosystem offers other features such as snapshot > isolation, indexes, clustering, delta streamer etc., which could be applied > to any datasets without record key. > > The idea of this proposal is to make both the record key and preCombine key > optional to allow variety of new use cases on top of Hudi. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2681) Make hoodie record_key and preCombine_key optional
Vinoth Govindarajan created HUDI-2681: - Summary: Make hoodie record_key and preCombine_key optional Key: HUDI-2681 URL: https://issues.apache.org/jira/browse/HUDI-2681 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: Vinoth Govindarajan At present, Hudi needs an record key and preCombine key to create an Hudi datasets, which puts an restriction on the kinds of datasets we can create using Hudi. In order to increase the adoption of Hudi file format across all kinds of derived datasets, similar to Parquet/ORC, we need to offer flexibility to users. I understand that record key is used for upsert primitive and we need preCombine key to break the tie and deduplicate, but there are event data and other datasets without any primary key (append only datasets), which can benefit from Hudi since Hudi ecosystem offers other features such as snapshot isolation, indexes, clustering, delta streamer etc., which could be applied to any datasets without record key. The idea of this proposal is to make both the record key and preCombine key optional to allow variety of new use cases on top of Hudi. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2510) QuickStart html page is showing 404
[ https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2510: -- Status: In Progress (was: Open) > QuickStart html page is showing 404 > --- > > Key: HUDI-2510 > URL: https://issues.apache.org/jira/browse/HUDI-2510 > Project: Apache Hudi > Issue Type: Bug >Reporter: Rajesh Mahindra >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available > > Some external entities such as GCP are linking to > [https://hudi.apache.org/quickstart.html] for quick start. > > [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc] > > Can we create an alias to the actual quick start link? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2510) QuickStart html page is showing 404
[ https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2510: -- Status: Patch Available (was: In Progress) > QuickStart html page is showing 404 > --- > > Key: HUDI-2510 > URL: https://issues.apache.org/jira/browse/HUDI-2510 > Project: Apache Hudi > Issue Type: Bug >Reporter: Rajesh Mahindra >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available > > Some external entities such as GCP are linking to > [https://hudi.apache.org/quickstart.html] for quick start. > > [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc] > > Can we create an alias to the actual quick start link? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2510) QuickStart html page is showing 404
[ https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424790#comment-17424790 ] Vinoth Govindarajan commented on HUDI-2510: --- [~rmahindra] - I have explored the option to add a redirect in the config file, but for some reason its not allowing/working, hence I create a new quickstart page and did the redirection using JS. This is the PR for the same, please review: https://github.com/apache/hudi/pull/3753 > QuickStart html page is showing 404 > --- > > Key: HUDI-2510 > URL: https://issues.apache.org/jira/browse/HUDI-2510 > Project: Apache Hudi > Issue Type: Bug >Reporter: Rajesh Mahindra >Assignee: Vinoth Govindarajan >Priority: Major > > Some external entities such as GCP are linking to > [https://hudi.apache.org/quickstart.html] for quick start. > > [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc] > > Can we create an alias to the actual quick start link? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2438: -- Description: BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. More details are in RFC-34: [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] was: BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. RFC-34: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980 > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.10.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > More details are in RFC-34: > [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2438: -- Description: BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. RFC-34: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980 was:BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. > [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync > > > Key: HUDI-2438 > URL: https://issues.apache.org/jira/browse/HUDI-2438 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: BigQuery, Integration > Fix For: 0.10.0 > > > BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective > analytics data warehouse that lets you run analytics over vast amounts of > data in near real-time. BigQuery currently [doesn’t > support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache > Hudi file format, but it has support for the Parquet file format. The > proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi > table as the BigQuery External Parquet table so that users can query the Hudi > tables using BigQuery. Uber is already syncing some of its Hudi tables to > BigQuery data mart this will help them to write, sync, and query. > > RFC-34: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
Vinoth Govindarajan created HUDI-2438: - Summary: [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync Key: HUDI-2438 URL: https://issues.apache.org/jira/browse/HUDI-2438 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 0.10.0 BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2407) [Website] Support staging site for per pull request
Vinoth Govindarajan created HUDI-2407: - Summary: [Website] Support staging site for per pull request Key: HUDI-2407 URL: https://issues.apache.org/jira/browse/HUDI-2407 Project: Apache Hudi Issue Type: New Feature Components: Docs Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Provide a staging site for asf website pull request. Something similar to this Jekyll site PR: https://github.com/apache/hudi/pull/1698 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS
[ https://issues.apache.org/jira/browse/HUDI-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2357: -- Description: MERGE INTO command doesn't select the correct primary key for tables created using CTAS, whereas it works for tables created using CREATE TABLE command. I guess we are hitting this issue because the key generator class is set to SqlKeyGenerator for tables created using CTAS: working use-case: {code:java} create table h5 (id bigint, name string, ts bigint) using hudi options (type = "cow" , primaryKey="id" , preCombineField="ts" ); merge into h5 as t0 using ( select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts ) t1 on t1.s_id = t0.id when matched then update set * when not matched then insert *; {code} hoodie.properties for working use-case: {code:java} ➜ analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties #Properties saved on Wed Aug 25 04:10:33 UTC 2021 #Wed Aug 25 04:10:33 UTC 2021 hoodie.table.name=h5 hoodie.table.recordkey.fields=id hoodie.table.type=COPY_ON_WRITE hoodie.table.precombine.field=ts hoodie.table.partition.fields= hoodie.archivelog.folder=archived hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]} hoodie.timeline.layout.version=1 hoodie.table.version=1{code} Whereas this doesn't work: {code:java} create table h4 using hudi options (type = "cow" , primaryKey="id" , preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, current_timestamp(); merge into h3 as t0u sing (select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1 on t1.s_id = t0.id when matched then update set * when not matched then insert *; ERROR LOG 544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver - Failed in [merge into analytics.h3 as t0using ( select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set *when not matched then insert *]java.lang.IllegalArgumentException: Merge Key[id] is not Equal to the defined primary key[] in table h3 at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:490) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at
[jira] [Updated] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS
[ https://issues.apache.org/jira/browse/HUDI-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2357: -- Description: MERGE INTO command doesn't select the correct primary key for tables created using CTAS, whereas it works for tables created using CREATE TABLE command. I guess we are hitting this issue because the key generator class is set to SqlKeyGenerator for tables created using CTAS: working use-case: {code:java} create table h5 (id bigint, name string, ts bigint) using hudi options (type = "cow" , primaryKey="id" , preCombineField="ts" ); merge into h5 as t0 using ( select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts ) t1 on t1.s_id = t0.id when matched then update set * when not matched then insert *; {code} hoodie.properties for working use-case: {code:java} ➜ analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties #Properties saved on Wed Aug 25 04:10:33 UTC 2021 #Wed Aug 25 04:10:33 UTC 2021 hoodie.table.name=h5 hoodie.table.recordkey.fields=id hoodie.table.type=COPY_ON_WRITE hoodie.table.precombine.field=ts hoodie.table.partition.fields= hoodie.archivelog.folder=archived hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]} hoodie.timeline.layout.version=1 hoodie.table.version=1{code} Whereas this doesn't work: {code:java} create table h4create table h4using hudioptions (type = "cow" , primaryKey="id" , preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, current_timestamp(); merge into h3 as t0using ( select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set * when not matched then insert *; ERROR LOG 544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver - Failed in [merge into analytics.h3 as t0using ( select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set *when not matched then insert *]java.lang.IllegalArgumentException: Merge Key[id] is not Equal to the defined primary key[] in table h3 at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:490) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at
[jira] [Created] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS
Vinoth Govindarajan created HUDI-2357: - Summary: MERGE INTO doesn't work for tables created using CTAS Key: HUDI-2357 URL: https://issues.apache.org/jira/browse/HUDI-2357 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: Vinoth Govindarajan Assignee: pengzhiwei Fix For: 0.9.0 MERGE INTO command doesn't select the correct primaryKey for tables created using CTAS, whereas it works for tables created using CREATE TABLE command. I guess we are hitting this issue because the keygenerator class is set to SqlKeyGenerator for tables created using CTAS: This works: {code:java} create table h5 (id bigint, name string, ts bigint) using hudi options (type = "cow" , primaryKey="id" , preCombineField="ts" ); merge into h5 as t0 using ( select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts ) t1 on t1.s_id = t0.id when matched then update set * when not matched then insert *; {code} hoodie.properties for working use-case: {code:java} ➜ analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties #Properties saved on Wed Aug 25 04:10:33 UTC 2021 #Wed Aug 25 04:10:33 UTC 2021 hoodie.table.name=h5 hoodie.table.recordkey.fields=id hoodie.table.type=COPY_ON_WRITE hoodie.table.precombine.field=ts hoodie.table.partition.fields= hoodie.archivelog.folder=archived hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]} hoodie.timeline.layout.version=1 hoodie.table.version=1{code} Whereas this doesn't work: {code:java} create table h4create table h4using hudioptions (type = "cow" , primaryKey="id" , preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, current_timestamp(); merge into h3 as t0using ( select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set * when not matched then insert *; 544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver - Failed in [merge into analytics.h3 as t0using ( select '5' as s_id, 'vinoth' as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set *when not matched then insert *]java.lang.IllegalArgumentException: Merge Key[id] is not Equal to the defined primary key[] in table h3 at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at
[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2319: -- Summary: Integrate hudi with dbt (data build tool) (was: Integrate hudi with dbt) > Integrate hudi with dbt (data build tool) > - > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: integration > Fix For: 0.10.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2319) Integrate hudi with dbt
[ https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-2319: -- Status: In Progress (was: Open) > Integrate hudi with dbt > --- > > Key: HUDI-2319 > URL: https://issues.apache.org/jira/browse/HUDI-2319 > Project: Apache Hudi > Issue Type: New Feature > Components: Usability >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: integration > Fix For: 0.10.0 > > > dbt (data build tool) enables analytics engineers to transform data in their > warehouses by simply writing select statements. dbt handles turning these > select statements into tables and views. > dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it > doesn’t extract or load data, but it’s extremely good at transforming data > that’s already loaded into your warehouse. > > dbt currently supports only delta file format, there are few folks in the dbt > community asking for hudi integration, moreover, adding dbt integration will > make it easier to create derived datasets for any data engineer/analyst. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2319) Integrate hudi with dbt
Vinoth Govindarajan created HUDI-2319: - Summary: Integrate hudi with dbt Key: HUDI-2319 URL: https://issues.apache.org/jira/browse/HUDI-2319 Project: Apache Hudi Issue Type: New Feature Components: Usability Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 0.10.0 dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views. dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. dbt currently supports only delta file format, there are few folks in the dbt community asking for hudi integration, moreover, adding dbt integration will make it easier to create derived datasets for any data engineer/analyst. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer
[ https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397526#comment-17397526 ] Vinoth Govindarajan commented on HUDI-2275: --- I can confirm that the is the correct settings to use for backfill pipelines and it works for our use case, in our case we have set this option for our concurrent backfill pipelines so that the backfill pipelines don't change the delta streamer checkpoint key and copies over the last checkpoint key for the backfill commit, hence not interrupting the regular incremental query. {code:java} option("hoodie.write.meta.key.prefixes", "deltastreamer.checkpoint.key") {code} > HoodieDeltaStreamerException when using OCC and a second concurrent writer > -- > > Key: HUDI-2275 > URL: https://issues.apache.org/jira/browse/HUDI-2275 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer, Spark Integration, Writer Core >Affects Versions: 0.9.0 >Reporter: Dave Hagman >Assignee: Sagar Sumit >Priority: Critical > Fix For: 0.10.0 > > > I am trying to utilize [Optimistic Concurrency > Control|https://hudi.apache.org/docs/concurrency_control] in order to allow > two writers to update a single table simultaneously. The two writers are: > * Writer A: Deltastreamer job consuming continuously from Kafka > * Writer B: A spark datasource-based writer that is consuming parquet files > out of S3 > * Table Type: Copy on Write > > After a few commits from each writer the deltastreamer will fail with the > following exception: > > {code:java} > org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find > previous checkpoint. Please double check if this table was indeed built via > delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, > Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > ...{code} > > What appears to be happening is a lack of commit isolation between the two > writers > Writer B (spark datasource writer) will land commits which are eventually > picked up by Writer A (Delta Streamer). This is an issue because the Delta > Streamer needs checkpoint information which the spark datasource of course > does not include in its commits. My understanding was that OCC was built for > this very purpose (among others). > OCC config for Delta Streamer: > {code:java} > hoodie.write.concurrency.mode=optimistic_concurrency_control > hoodie.cleaner.policy.failed.writes=LAZY > > hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider > hoodie.write.lock.zookeeper.url= > hoodie.write.lock.zookeeper.port=2181 > hoodie.write.lock.zookeeper.lock_key=writer_lock > hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code} > > OCC config for spark datasource: > {code:java} > // Multi-writer concurrency > .option("hoodie.cleaner.policy.failed.writes", "LAZY") > .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") > .option( > "hoodie.write.lock.provider", > > org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName() > ) > .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost) > .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort) > .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock") > .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code} > h3. Steps to Reproduce: > * Start a deltastreamer job against some table Foo > * In parallel, start writing to the same table Foo using spark datasource > writer > * Note that after a few commits from each the deltastreamer is likely to > fail with the above exception when the datasource writer creates non-isolated > inflight commits > NOTE: I have not tested this with two of the same datasources (ex. two > deltastreamer jobs) > NOTE 2: Another detail that may be relevant is that the two writers are on > completely different spark clusters but I assumed this shouldn't be an issue > since we're locking using Zookeeper -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1985) Website re-design implementation
[ https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1985: -- Status: In Progress (was: Open) > Website re-design implementation > > > Key: HUDI-1985 > URL: https://issues.apache.org/jira/browse/HUDI-1985 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: documentation, pull-request-available > Fix For: 0.9.0 > > > To provide better navigation and organization of Hudi website's info, we have > done a re-design of the web pages. > Previous discussion > [https://github.com/apache/hudi/issues/2905] > > See the wireframe and final design in > [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6] > (login Figma to comment) > The design is ready for implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1985) Website re-design implementation
[ https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan reassigned HUDI-1985: - Assignee: Vinoth Govindarajan > Website re-design implementation > > > Key: HUDI-1985 > URL: https://issues.apache.org/jira/browse/HUDI-1985 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Raymond Xu >Assignee: Vinoth Govindarajan >Priority: Blocker > Labels: documentation > Fix For: 0.9.0 > > > To provide better navigation and organization of Hudi website's info, we have > done a re-design of the web pages. > Previous discussion > [https://github.com/apache/hudi/issues/2905] > > See the wireframe and final design in > [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6] > (login Figma to comment) > The design is ready for implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1985) Website re-design implementation
[ https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379583#comment-17379583 ] Vinoth Govindarajan commented on HUDI-1985: --- Hi [~xushiyan], I have experience in the past building websites, I can volunteer to work on this re-design. > Website re-design implementation > > > Key: HUDI-1985 > URL: https://issues.apache.org/jira/browse/HUDI-1985 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: Raymond Xu >Priority: Blocker > Labels: documentation > Fix For: 0.9.0 > > > To provide better navigation and organization of Hudi website's info, we have > done a re-design of the web pages. > Previous discussion > [https://github.com/apache/hudi/issues/2905] > > See the wireframe and final design in > [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6] > (login Figma to comment) > The design is ready for implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1117) Add tdunning json library to spark and utilities bundle
[ https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378256#comment-17378256 ] Vinoth Govindarajan commented on HUDI-1117: --- Even after adding the JSON jar to the classpath, it didn't resolve the issue. > Add tdunning json library to spark and utilities bundle > --- > > Key: HUDI-1117 > URL: https://issues.apache.org/jira/browse/HUDI-1117 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:high, user-support-issues > Fix For: 0.9.0 > > > Exception during Hive Sync: > ``` > An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: > org/json/JSONException\n\tat > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat > > org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat > > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat > org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat > org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat > org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat > org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat > org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat > > org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat > org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat > ``` > This is from using hudi-spark-bundle. > [https://github.com/apache/hudi/issues/1787] > JSONException class is coming from > https://mvnrepository.com/artifact/org.json/json There is licensing issue and > hence not part of hudi bundle packages. The underlying issue is due to Hive > 1.x vs 2.x ( See > https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20) > Spark Hive integration still brings in hive 1.x jars which depends on > org.json. I believe this was provided in user's environment and hence we have > not seen folks complaining about this issue. > Even though this is not Hudi issue per se, let me check a jar with compatible > license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it > works, we will add to 0.6 bundles after discussing with community. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1743: -- Status: Closed (was: Patch Available) > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available, sev:normal > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
[ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1790: -- Status: Patch Available (was: In Progress) > Add SqlSource for DeltaStreamer to support backfill use cases > - > > Key: HUDI-1790 > URL: https://issues.apache.org/jira/browse/HUDI-1790 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: pull-request-available > > Delta Streamer is great for incremental workloads, but we need to support > backfills for use cases like adding a new column and backfill only that > column for the last 6 months, and if there was a bug in our transformation > logic and we need to reprocess a couple of older partitions. > > If we have a SqlSource as one of the input source to the delta streamer, then > I can pass any custom Spark SQL queries selecting specific partitions and > backfill. > > When we do the backfill, we don't need to update the last processed commit > checkpoint, this has to copy the last processed checkpoint before the > backfill and copy that over to the backfill commit. > > cc [~nishith29] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table
[ https://issues.apache.org/jira/browse/HUDI-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325466#comment-17325466 ] Vinoth Govindarajan commented on HUDI-1747: --- [~shivnarayan] - Here are the answers: * Do you see the exception during first time itself when trying to incrementally read from hudi MOR table via delta streamer? or after few incremental pulls via deltastreamer. ** I see it the first time itself. * I assume the schema given in the stack trace is the source table schema of deltastreamer. Can you confirm that. ** It's a big table, the schema is just the first few fields, not all the fields are shown in the log. * I need to set aside sometime to try and reproduce this. but if you have steps to reproduce w/ configs, would be awesome. but if it might consume too much time for you, nvm. ** I've a test pipeline which you can use for debugging, I will share the details in slack. > Deltastreamer incremental read is not working on the MOR table > -- > > Key: HUDI-1747 > URL: https://issues.apache.org/jira/browse/HUDI-1747 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Vinoth Govindarajan >Priority: Critical > Labels: sev:critical > > I was trying to read the MOR HUDI table incrementally using delta streamer, > while doing that I ran into this issue where it says: > {code:java} > Found recursive reference in Avro schema, which can not be processed by > Spark:{code} > Spark Version: 2.4 > Hudi Version: 0.7.0-SNAPSHOT or the latest master > > Full Stack Trace: > {code:java} > Found recursive reference in Avro schema, which can not be processed by Spark: > { > "type" : "record", > "name" : "meta", > "fields" : [ { > "name" : "verified", > "type" : [ "null", "boolean" ], > "default" : null > }, { > "name" : "zip", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "lname", > "type" : [ "null", "string" ], > "default" : null > }] > } > > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Updated] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
[ https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1790: -- Status: In Progress (was: Open) > Add SqlSource for DeltaStreamer to support backfill use cases > - > > Key: HUDI-1790 > URL: https://issues.apache.org/jira/browse/HUDI-1790 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > > Delta Streamer is great for incremental workloads, but we need to support > backfills for use cases like adding a new column and backfill only that > column for the last 6 months, and if there was a bug in our transformation > logic and we need to reprocess a couple of older partitions. > > If we have a SqlSource as one of the input source to the delta streamer, then > I can pass any custom Spark SQL queries selecting specific partitions and > backfill. > > When we do the backfill, we don't need to update the last processed commit > checkpoint, this has to copy the last processed checkpoint before the > backfill and copy that over to the backfill commit. > > cc [~nishith29] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases
Vinoth Govindarajan created HUDI-1790: - Summary: Add SqlSource for DeltaStreamer to support backfill use cases Key: HUDI-1790 URL: https://issues.apache.org/jira/browse/HUDI-1790 Project: Apache Hudi Issue Type: New Feature Components: DeltaStreamer Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Delta Streamer is great for incremental workloads, but we need to support backfills for use cases like adding a new column and backfill only that column for the last 6 months, and if there was a bug in our transformation logic and we need to reprocess a couple of older partitions. If we have a SqlSource as one of the input source to the delta streamer, then I can pass any custom Spark SQL queries selecting specific partitions and backfill. When we do the backfill, we don't need to update the last processed commit checkpoint, this has to copy the last processed checkpoint before the backfill and copy that over to the backfill commit. cc [~nishith29] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1743: -- Status: Patch Available (was: In Progress) > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1743: -- Status: In Progress (was: Open) > Add support for Spark SQL File based transformer for deltastreamer > -- > > Key: HUDI-1743 > URL: https://issues.apache.org/jira/browse/HUDI-1743 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: features, pull-request-available > > The current SQLQuery based transformer is limited in functionality, you can't > pass multiple Spark SQL statements separated by a semicolon which is > necessary if your transformation is complex. > > The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file > as input with multiple Spark SQL statements and applies the transformation to > the delta streamer payload. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317655#comment-17317655 ] Vinoth Govindarajan commented on HUDI-1762: --- PR merged. > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1762: -- Status: Closed (was: Patch Available) > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently
[ https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1770: -- Affects Version/s: 0.7.0 > Deltastreamer throws errors when not running frequently > --- > > Key: HUDI-1770 > URL: https://issues.apache.org/jira/browse/HUDI-1770 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.7.0, 0.8.0 >Reporter: Vinoth Govindarajan >Priority: Major > > When delta streamer is using HoodieIncrSource from another parent Hudi table, > it runs into this error, when you are not running your delta streamer > pipeline frequently. > > {code:java} > User class threw exception: org.apache.spark.sql.AnalysisException: Path does > not exist: > hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641) > at > org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at >
[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently
[ https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1770: -- Affects Version/s: 0.8.0 > Deltastreamer throws errors when not running frequently > --- > > Key: HUDI-1770 > URL: https://issues.apache.org/jira/browse/HUDI-1770 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.8.0 >Reporter: Vinoth Govindarajan >Priority: Major > > When delta streamer is using HoodieIncrSource from another parent Hudi table, > it runs into this error, when you are not running your delta streamer > pipeline frequently. > > {code:java} > User class threw exception: org.apache.spark.sql.AnalysisException: Path does > not exist: > hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641) > at > org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at >
[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently
[ https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1770: -- Description: When delta streamer is using HoodieIncrSource from another parent Hudi table, it runs into this error, when you are not running your delta streamer pipeline frequently. {code:java} User class threw exception: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:355) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641) at org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at
[jira] [Created] (HUDI-1770) Deltastreamer throws errors when not running frequently
Vinoth Govindarajan created HUDI-1770: - Summary: Deltastreamer throws errors when not running frequently Key: HUDI-1770 URL: https://issues.apache.org/jira/browse/HUDI-1770 Project: Apache Hudi Issue Type: Bug Components: DeltaStreamer Reporter: Vinoth Govindarajan When delta streamer is using HoodieIncrSource from another parent Hudi table, it runs into this error: {code:java} User class threw exception: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:355) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641) at org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1762: -- Status: Patch Available (was: In Progress) > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
[ https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1762: -- Status: In Progress (was: Open) > Hive Sync is not working with Hive Style Partitioning > - > > Key: HUDI-1762 > URL: https://issues.apache.org/jira/browse/HUDI-1762 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: hive, pull-request-available > > When you create a Hudi table with hive style partitioning and enable the hive > sync, it didn't work because it's assuming the partition will be separated by > a slash. > > when the hive style partitioning is enabled for the target table like this: > {code:java} > hoodie.datasource.write.partitionpath.field=datestr > hoodie.datasource.write.hive_style_partitioning=true > {code} > This is the error it throws: > {code:java} > 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running > delta sync once. Shutting down > org.apache.hudi.exception.HoodieException: Got runtime exception when hive > syncing delta_streamer_test > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table fact_scheduled_trip__1pc_trip_uuid > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) > ... 12 more > Caused by: java.lang.IllegalArgumentException: Partition path > datestr=2021-03-28 is not in the form /mm/dd > at > org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) > at > org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) > at > org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) > ... 14 more > {code} > To fix this issue we need to create a new partition extractor class and > assign that class name as the hive sync partition extractor. > After you define the new partition extractor class, you can configure it like > this: > {code:java} > hoodie.datasource.hive_sync.enable=true > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning
Vinoth Govindarajan created HUDI-1762: - Summary: Hive Sync is not working with Hive Style Partitioning Key: HUDI-1762 URL: https://issues.apache.org/jira/browse/HUDI-1762 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan When you create a Hudi table with hive style partitioning and enable the hive sync, it didn't work because it's assuming the partition will be separated by a slash. when the hive style partitioning is enabled for the target table like this: {code:java} hoodie.datasource.write.partitionpath.field=datestr hoodie.datasource.write.hive_style_partitioning=true {code} This is the error it throws: {code:java} 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running delta sync once. Shutting down org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing delta_streamer_test at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560) at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table fact_scheduled_trip__1pc_trip_uuid at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108) ... 12 more Caused by: java.lang.IllegalArgumentException: Partition path datestr=2021-03-28 is not in the form /mm/dd at org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55) at org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221) ... 14 more {code} To fix this issue we need to create a new partition extractor class and assign that class name as the hive sync partition extractor. After you define the new partition extractor class, you can configure it like this: {code:java} hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table
Vinoth Govindarajan created HUDI-1747: - Summary: Deltastreamer incremental read is not working on the MOR table Key: HUDI-1747 URL: https://issues.apache.org/jira/browse/HUDI-1747 Project: Apache Hudi Issue Type: Bug Components: Common Core Reporter: Vinoth Govindarajan I was trying to read the MOR HUDI table incrementally using delta streamer, while doing that I ran into this issue where it says: {code:java} Found recursive reference in Avro schema, which can not be processed by Spark:{code} Spark Version: 2.4 Hudi Version: 0.7.0-SNAPSHOT or the latest master Full Stack Trace: {code:java} Found recursive reference in Avro schema, which can not be processed by Spark: { "type" : "record", "name" : "meta", "fields" : [ { "name" : "verified", "type" : [ "null", "boolean" ], "default" : null }, { "name" : "zip", "type" : [ "null", "string" ], "default" : null }, { "name" : "lname", "type" : [ "null", "string" ], "default" : null }] } at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at
[jira] [Created] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer
Vinoth Govindarajan created HUDI-1743: - Summary: Add support for Spark SQL File based transformer for deltastreamer Key: HUDI-1743 URL: https://issues.apache.org/jira/browse/HUDI-1743 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan The current SQLQuery based transformer is limited in functionality, you can't pass multiple Spark SQL statements separated by a semicolon which is necessary if your transformation is complex. The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file as input with multiple Spark SQL statements and applies the transformation to the delta streamer payload. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-783) Add official python support to create hudi datasets using pyspark
[ https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-783. > Add official python support to create hudi datasets using pyspark > - > > Key: HUDI-783 > URL: https://issues.apache.org/jira/browse/HUDI-783 > Project: Apache Hudi > Issue Type: Wish > Components: Utilities >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: features, pull-request-available > Fix For: 0.6.0 > > > *Goal:* > As a pyspark user, I would like to read/write hudi datasets using pyspark. > There are several components to achieve this goal. > # Create a hudi-pyspark package that users can import and start > reading/writing hudi datasets. > # Explain how to read/write hudi datasets using pyspark in a blog > post/documentation. > # Add the hudi-pyspark module to the hudi demo docker along with the > instructions. > # Make the package available as part of the [spark packages > index|https://spark-packages.org/] and [python package > index|https://pypi.org/] > hudi-pyspark packages should implement HUDI data source API for Apache Spark > using which HUDI files can be read as DataFrame and write to any Hadoop > supported file system. > Usage pattern after we launch this feature should be something like this: > Install the package using: > {code:java} > pip install hudi-pyspark{code} > or > Include hudi-pyspark package in your Spark Applications using: > spark-shell, pyspark, or spark-submit > {code:java} > > $SPARK_HOME/bin/spark-shell --packages > > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan resolved HUDI-1695. --- Resolution: Fixed > Deltastreamer HoodieIncrSource exception error messaging is incorrect > - > > Key: HUDI-1695 > URL: https://issues.apache.org/jira/browse/HUDI-1695 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Trivial > Labels: beginner, pull-request-available > Fix For: 0.8.0 > > > When you set your source_class as HoodieIncrSource and invoke deltastreamer > without any checkpoint, it throws the following Exception: > > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code} > > The error messaging is wrong and misleading, the correct parameter is: > {code:java} > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt > {code} > Check out the correct parameter in this > [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78] > > The correct messaging should be: > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan closed HUDI-1695. - > Deltastreamer HoodieIncrSource exception error messaging is incorrect > - > > Key: HUDI-1695 > URL: https://issues.apache.org/jira/browse/HUDI-1695 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Trivial > Labels: beginner, pull-request-available > Fix For: 0.8.0 > > > When you set your source_class as HoodieIncrSource and invoke deltastreamer > without any checkpoint, it throws the following Exception: > > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code} > > The error messaging is wrong and misleading, the correct parameter is: > {code:java} > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt > {code} > Check out the correct parameter in this > [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78] > > The correct messaging should be: > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302235#comment-17302235 ] Vinoth Govindarajan commented on HUDI-1695: --- PR has been merged. > Deltastreamer HoodieIncrSource exception error messaging is incorrect > - > > Key: HUDI-1695 > URL: https://issues.apache.org/jira/browse/HUDI-1695 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Trivial > Labels: beginner, pull-request-available > Fix For: 0.8.0 > > > When you set your source_class as HoodieIncrSource and invoke deltastreamer > without any checkpoint, it throws the following Exception: > > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code} > > The error messaging is wrong and misleading, the correct parameter is: > {code:java} > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt > {code} > Check out the correct parameter in this > [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78] > > The correct messaging should be: > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-1695: -- Summary: Deltastreamer HoodieIncrSource exception error messaging is incorrect (was: Deltastream HoodieIncrSource exception error messaging is incorrect) > Deltastreamer HoodieIncrSource exception error messaging is incorrect > - > > Key: HUDI-1695 > URL: https://issues.apache.org/jira/browse/HUDI-1695 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Trivial > Labels: beginner > Fix For: 0.8.0 > > > When you set your source_class as HoodieIncrSource and invoke deltastreamer > without any checkpoint, it throws the following Exception: > > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code} > > The error messaging is wrong and misleading, the correct parameter is: > {code:java} > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt > {code} > Check out the correct parameter in this > [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78] > > The correct messaging should be: > {code:java} > User class threw exception: java.lang.IllegalArgumentException: Missing begin > instant for incremental pull. For reading from latest committed instant set > hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1695) Deltastream HoodieIncrSource exception error messaging is incorrect
Vinoth Govindarajan created HUDI-1695: - Summary: Deltastream HoodieIncrSource exception error messaging is incorrect Key: HUDI-1695 URL: https://issues.apache.org/jira/browse/HUDI-1695 Project: Apache Hudi Issue Type: Bug Components: DeltaStreamer Reporter: Vinoth Govindarajan Assignee: Vinoth Govindarajan Fix For: 0.8.0 When you set your source_class as HoodieIncrSource and invoke deltastreamer without any checkpoint, it throws the following Exception: {code:java} User class threw exception: java.lang.IllegalArgumentException: Missing begin instant for incremental pull. For reading from latest committed instant set hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code} The error messaging is wrong and misleading, the correct parameter is: {code:java} hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt {code} Check out the correct parameter in this [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78] The correct messaging should be: {code:java} User class threw exception: java.lang.IllegalArgumentException: Missing begin instant for incremental pull. For reading from latest committed instant set hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark
[ https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-825: - Status: Open (was: New) > Write a small blog on how to use hudi-spark with pyspark > > > Key: HUDI-825 > URL: https://issues.apache.org/jira/browse/HUDI-825 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Major > Labels: user-support-issues > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi
[ https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272397#comment-17272397 ] Vinoth Govindarajan commented on HUDI-824: -- [~nagarwal] - All the apache projects are available directly to use with `–packages` option, I tried with pyspark it worked: {code:java} spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' {code} The same instructions have been updated in the following doc: [https://hudi.apache.org/docs/quick-start-guide.html] No further action need, let me know if its okay to close this issue. > Register hudi-spark package with spark packages repo for easier usage of Hudi > - > > Key: HUDI-824 > URL: https://issues.apache.org/jira/browse/HUDI-824 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: user-support-issues > > At the moment, to be able to use Hudi with spark, users have to do the > following : > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --jars `ls > packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` > \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}} > {{}} > {{Ideally, we want to be able to use Hudi as follows :}} > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages > org.apache.hudi:hudi-spark-bundle: \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi
[ https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Govindarajan updated HUDI-824: - Status: Open (was: New) > Register hudi-spark package with spark packages repo for easier usage of Hudi > - > > Key: HUDI-824 > URL: https://issues.apache.org/jira/browse/HUDI-824 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Nishith Agarwal >Assignee: Vinoth Govindarajan >Priority: Minor > Labels: user-support-issues > > At the moment, to be able to use Hudi with spark, users have to do the > following : > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ > --jars `ls > packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` > \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}} > {{}} > {{Ideally, we want to be able to use Hudi as follows :}} > > {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages > org.apache.hudi:hudi-spark-bundle: \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark
[ https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130943#comment-17130943 ] Vinoth Govindarajan commented on HUDI-783: -- Status update: * -Explain how to read/write hudi datasets using pyspark in a blog post/documentation.- Done ** Here is the documentation: [https://hudi.apache.org/docs/quick-start-guide.html#pyspark-example] ** There is a separate ticket for blog post - HUDI-825 * -Add the hudi-pyspark module to the hudi demo docker along with the instructions.- Done ** pyspark is now supported in hudi demo docker, here is the [PR|[https://github.com/apache/hudi/pull/1632]] * -Make the package available as part of the [spark packages index|https://spark-packages.org/] and [python package index|https://pypi.org/]- Done ** Since hudi is already part of apache project, we can directly use it as a package: ** {code:java} export PYSPARK_PYTHON=$(which python3) spark-2.4.4-bin-hadoop2.7/bin/pyspark \ --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' {code} > Add official python support to create hudi datasets using pyspark > - > > Key: HUDI-783 > URL: https://issues.apache.org/jira/browse/HUDI-783 > Project: Apache Hudi > Issue Type: Wish > Components: Utilities >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: features, pull-request-available > Fix For: 0.6.0 > > > *Goal:* > As a pyspark user, I would like to read/write hudi datasets using pyspark. > There are several components to achieve this goal. > # Create a hudi-pyspark package that users can import and start > reading/writing hudi datasets. > # Explain how to read/write hudi datasets using pyspark in a blog > post/documentation. > # Add the hudi-pyspark module to the hudi demo docker along with the > instructions. > # Make the package available as part of the [spark packages > index|https://spark-packages.org/] and [python package > index|https://pypi.org/] > hudi-pyspark packages should implement HUDI data source API for Apache Spark > using which HUDI files can be read as DataFrame and write to any Hadoop > supported file system. > Usage pattern after we launch this feature should be something like this: > Install the package using: > {code:java} > pip install hudi-pyspark{code} > or > Include hudi-pyspark package in your Spark Applications using: > spark-shell, pyspark, or spark-submit > {code:java} > > $SPARK_HOME/bin/spark-shell --packages > > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark
[ https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085253#comment-17085253 ] Vinoth Govindarajan commented on HUDI-783: -- Thanks, [~vinoth]! We don't need to write any wrapper code in python to use it in pyspark, the existing jar files can be packaged and used in pyspark using the data source API method. To improve the client experience, we need to register the hudi-spark bundle as a package on the [https://spark-packages.org/register] website. [spark-packages.org|https://spark-packages.org/] is an external, community-managed list of third-party libraries, add-ons, and applications that work with Apache Spark. To register a package, Its content must be hosted by [GitHub|https://github.com/] in a public repo under the owner's account, I guess you can register the package since you are the owner of the repo. Thoughts? > Add official python support to create hudi datasets using pyspark > - > > Key: HUDI-783 > URL: https://issues.apache.org/jira/browse/HUDI-783 > Project: Apache Hudi (incubating) > Issue Type: Wish > Components: Utilities >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Labels: features > Fix For: 0.6.0 > > > *Goal:* > As a pyspark user, I would like to read/write hudi datasets using pyspark. > There are several components to achieve this goal. > # Create a hudi-pyspark package that users can import and start > reading/writing hudi datasets. > # Explain how to read/write hudi datasets using pyspark in a blog > post/documentation. > # Add the hudi-pyspark module to the hudi demo docker along with the > instructions. > # Make the package available as part of the [spark packages > index|https://spark-packages.org/] and [python package > index|https://pypi.org/] > hudi-pyspark packages should implement HUDI data source API for Apache Spark > using which HUDI files can be read as DataFrame and write to any Hadoop > supported file system. > Usage pattern after we launch this feature should be something like this: > Install the package using: > {code:java} > pip install hudi-pyspark{code} > or > Include hudi-pyspark package in your Spark Applications using: > spark-shell, pyspark, or spark-submit > {code:java} > > $SPARK_HOME/bin/spark-shell --packages > > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)