[jira] [Created] (HUDI-6586) Add Incremental scan support to dbt

2023-07-24 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-6586:
-

 Summary: Add Incremental scan support to dbt
 Key: HUDI-6586
 URL: https://issues.apache.org/jira/browse/HUDI-6586
 Project: Apache Hudi
  Issue Type: Epic
  Components: connectors
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 1.0.0


The current dbt support adds only the basic hudi primitives, but with deeper 
integration we could enable faster ETL queries using the incremental read 
primitive similar to the deltastreamer support.

 

The goal of this epic is to enable incremental data processing for dbt.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4254) Refactor SnowflakeSyncTool and BigQuerySyncTool

2022-06-14 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-4254:
-

 Summary: Refactor SnowflakeSyncTool and BigQuerySyncTool
 Key: HUDI-4254
 URL: https://issues.apache.org/jira/browse/HUDI-4254
 Project: Apache Hudi
  Issue Type: Task
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


There are many similarities between SnowflakeSyncTool and BigQuerySyncTool, 
refactor the common methods to create an Abstract class then use the same to 
avoid code duplication.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4220) Create a utility to convert json records to avro

2022-06-09 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-4220:
-

 Summary: Create a utility to convert json records to avro
 Key: HUDI-4220
 URL: https://issues.apache.org/jira/browse/HUDI-4220
 Project: Apache Hudi
  Issue Type: Task
  Components: spark
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


We need a utility to convert topics of many gzip files of JSON lines into Avro.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Integrated Hudi with Snowflake

2022-05-22 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2832:
--
Summary: [Umbrella] [RFC-40] Integrated Hudi with Snowflake   (was: 
[Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake 
Integration)

> [Umbrella] [RFC-40] Integrated Hudi with Snowflake 
> ---
>
> Key: HUDI-2832
> URL: https://issues.apache.org/jira/browse/HUDI-2832
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.12.0
>
>
> Snowflake is a fully managed service that’s simple to use but can power a 
> near-unlimited number of concurrent workloads. Snowflake is a solution for 
> data warehousing, data lakes, data engineering, data science, data 
> application development, and securely sharing and consuming shared data. 
> Snowflake [doesn’t 
> support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html]
>  Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
> file format. This proposal is to implement a SnowflakeSync similar to 
> HiveSync to sync the Hudi table as the Snowflake External Parquet table so 
> that users can query the Hudi tables using Snowflake. Many users have 
> expressed interest in Hudi and other support channels asking to integrate 
> Hudi with Snowflake, this will unlock new use cases for Hudi.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4137) Implement SnowflakeSyncTool to support Hudi to Snowflake Integration

2022-05-22 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-4137:
-

 Summary: Implement SnowflakeSyncTool to support Hudi to Snowflake 
Integration
 Key: HUDI-4137
 URL: https://issues.apache.org/jira/browse/HUDI-4137
 Project: Apache Hudi
  Issue Type: Task
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 0.12.0


Implement SnowflakeSyncTool similar to the BigQuerySyncTool to support Hudi to 
Snowflake Integration



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4000) Docs around DBT

2022-05-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-4000:
--
Status: Patch Available  (was: In Progress)

> Docs around DBT
> ---
>
> Key: HUDI-4000
> URL: https://issues.apache.org/jira/browse/HUDI-4000
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> [https://github.com/apache/hudi/issues/5367]
> Do we have any step by step document on how to use Hudi in conjunction with 
> DBT. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4000) Docs around DBT

2022-05-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-4000:
--
Status: In Progress  (was: Open)

> Docs around DBT
> ---
>
> Key: HUDI-4000
> URL: https://issues.apache.org/jira/browse/HUDI-4000
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> [https://github.com/apache/hudi/issues/5367]
> Do we have any step by step document on how to use Hudi in conjunction with 
> DBT. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-1743.
-
Resolution: Fixed

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available, sev:normal
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reopened HUDI-1743:
---

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available, sev:normal
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan resolved HUDI-1743.
---

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available, sev:normal
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-1790.
-
Resolution: Fixed

> Add SqlSource for DeltaStreamer to support backfill use cases
> -
>
> Key: HUDI-1790
> URL: https://issues.apache.org/jira/browse/HUDI-1790
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available, sev:normal
>
> Delta Streamer is great for incremental workloads, but we need to support 
> backfills for use cases like adding a new column and backfill only that 
> column for the last 6 months, and if there was a bug in our transformation 
> logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then 
> I can pass any custom Spark SQL queries selecting specific partitions and 
> backfill.
>  
> When we do the backfill, we don't need to update the last processed commit 
> checkpoint, this has to copy the last processed checkpoint before the 
> backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan resolved HUDI-1762.
---

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-1762.
-
Resolution: Fixed

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reopened HUDI-1762:
---

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reopened HUDI-1790:
---

> Add SqlSource for DeltaStreamer to support backfill use cases
> -
>
> Key: HUDI-1790
> URL: https://issues.apache.org/jira/browse/HUDI-1790
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available, sev:normal
>
> Delta Streamer is great for incremental workloads, but we need to support 
> backfills for use cases like adding a new column and backfill only that 
> column for the last 6 months, and if there was a bug in our transformation 
> logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then 
> I can pass any custom Spark SQL queries selecting specific partitions and 
> backfill.
>  
> When we do the backfill, we don't need to update the last processed commit 
> checkpoint, this has to copy the last processed checkpoint before the 
> backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2022-04-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan resolved HUDI-1790.
---

> Add SqlSource for DeltaStreamer to support backfill use cases
> -
>
> Key: HUDI-1790
> URL: https://issues.apache.org/jira/browse/HUDI-1790
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available, sev:normal
>
> Delta Streamer is great for incremental workloads, but we need to support 
> backfills for use cases like adding a new column and backfill only that 
> column for the last 6 months, and if there was a bug in our transformation 
> logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then 
> I can pass any custom Spark SQL queries selecting specific partitions and 
> backfill.
>  
> When we do the backfill, we don't need to update the last processed commit 
> checkpoint, this has to copy the last processed checkpoint before the 
> backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-2319) Integrate hudi with dbt (data build tool)

2022-04-12 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-2319.
-
Resolution: Fixed

> Integrate hudi with dbt (data build tool)
> -
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Critical
>  Labels: integration, pull-request-available
> Fix For: 0.10.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-3838) Make Drop partition column config work with deltastreamer

2022-04-12 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan resolved HUDI-3838.
---

> Make Drop partition column config work with deltastreamer
> -
>
> Key: HUDI-3838
> URL: https://issues.apache.org/jira/browse/HUDI-3838
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> hoodie.datasource.write.drop.partition.columns only works for datasource 
> writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> 
> bigquery sync flow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3838) Make Drop partition column config work with deltastreamer

2022-04-12 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521370#comment-17521370
 ] 

Vinoth Govindarajan commented on HUDI-3838:
---

Added the support, see the linked PR for more details.

> Make Drop partition column config work with deltastreamer
> -
>
> Key: HUDI-3838
> URL: https://issues.apache.org/jira/browse/HUDI-3838
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> hoodie.datasource.write.drop.partition.columns only works for datasource 
> writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> 
> bigquery sync flow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3838) Make Drop partition column config work with deltastreamer

2022-04-12 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-3838.
-
Resolution: Fixed

> Make Drop partition column config work with deltastreamer
> -
>
> Key: HUDI-3838
> URL: https://issues.apache.org/jira/browse/HUDI-3838
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> hoodie.datasource.write.drop.partition.columns only works for datasource 
> writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> 
> bigquery sync flow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3838) Make Drop partition column config work with deltastreamer

2022-04-11 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3838:
--
Status: Patch Available  (was: In Progress)

> Make Drop partition column config work with deltastreamer
> -
>
> Key: HUDI-3838
> URL: https://issues.apache.org/jira/browse/HUDI-3838
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> hoodie.datasource.write.drop.partition.columns only works for datasource 
> writer. HoodieDeltaStreamer is not using it. We need it for deltastreamer -> 
> bigquery sync flow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2022-04-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2438:
--
Status: In Progress  (was: Open)

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core, meta-sync
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3371) Add Dremio integration support to hudi

2022-04-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3371:
--
Issue Type: New Feature  (was: Task)

> Add Dremio integration support to hudi
> --
>
> Key: HUDI-3371
> URL: https://issues.apache.org/jira/browse/HUDI-3371
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: sivabalan narayanan
>Assignee: Vinoth Govindarajan
>Priority: Major
>
> For BI reports and dashboards, dremio gives accelerated queries. So adding 
> dremio integration w/ hudi will be a added value to hudi users in general. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2022-04-03 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516608#comment-17516608
 ] 

Vinoth Govindarajan edited comment on HUDI-2438 at 4/4/22 3:07 AM:
---

Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, 
[~joyansil] implemented a Java client for this, the details are in this ticket: 
https://issues.apache.org/jira/browse/HUDI-3020

 

BigQuerySyncTool is also available, it will be part of 0.11.0 release, which 
already invoke the manifest generation code as part of the sync method.

 

Here is how you can test it out:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.metadata.ManifestFileUtilval conf = new 
Configuration();{code}
{code:java}
val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi"
val manifestFileUtil: ManifestFileUtil = 
ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build();
manifestFileUtil.writeManifestFile() {code}
{code:java}
gsutil cat 
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/.hoodie/manifest/latest-snapshot.csv%7Chead
 -5{code}
{code:java}
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/95e78133-08dd-4721-9c49-8fbe338589f0-0_621-10-1021_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/1c01ddf1-41e8-43bf-94f3-9a75c9e39b21-0_1238-12-1638_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/a021864b-d0b6-462a-8f47-6370668993d6-0_694-12-1094_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/11815af9-6ed5-4f05-9b6b-7f79479f4f53-0_625-12-1025_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/ef4fc799-9124-4656-8106-f0374590dde6-0_1151-12-1551_20210927173651.parquet
 {code}


was (Author: vino):
Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, 
[~joyansil] implemented a Java client for this, the details are in this ticket: 
https://issues.apache.org/jira/browse/HUDI-3020

 

BigQuerySyncTool is also available, it will be part of 0.11.0 release, which 
already invoke the manifest generation code as part of the sync method.

 

Here is how you can test it out:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.metadata.ManifestFileUtilval conf = new 
Configuration();{code}
{code:java}
val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi"
val manifestFileUtil: ManifestFileUtil = 
ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build();
manifestFileUtil.writeManifestFile() {code}
 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core, meta-sync
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2022-04-03 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516608#comment-17516608
 ] 

Vinoth Govindarajan commented on HUDI-2438:
---

Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, 
[~joyansil] implemented a Java client for this, the details are in this ticket: 
https://issues.apache.org/jira/browse/HUDI-3020

 

BigQuerySyncTool is also available, it will be part of 0.11.0 release, which 
already invoke the manifest generation code as part of the sync method.

 

Here is how you can test it out:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.metadata.ManifestFileUtilval conf = new 
Configuration();{code}
{code:java}
val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi"
val manifestFileUtil: ManifestFileUtil = 
ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build();
manifestFileUtil.writeManifestFile() {code}
 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core, meta-sync
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3748) Hudi fails to insert into a partitioned table when the partition column is dropped from the parquet schema

2022-03-29 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3748:
--
Description: 
When you add this config to drop the partition column from the parquet schema 
to support BigQuery, hudi fails to insert with the following error.

 

Steps to reproduce:

Start hudi docker:
{code:java}
cd hudi/docker
./setup_demo.sh{code}
{code:java}
docker exec -it adhoc-2 /bin/bash{code}
{code:java}
# Log into spark-sql and execute the following commands:
spark-sql  --jars $HUDI_SPARK_BUNDLE \
  --master local[2] \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
{code}
 
{code:java}
create table bq_demo_partitioned_cow (
   id bigint,
   name string,
   price double,
   ts bigint,
   dt string
) using hudi
partitioned by (dt)
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
);


insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(1, 'a1', 
10, current_timestamp());
insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 
20, current_timestamp()); {code}
Error:
{code:java}
22/03/29 20:58:02 INFO spark.SparkContext: Starting job: collect at 
HoodieSparkEngineContext.java:100
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Got job 63 (collect at 
HoodieSparkEngineContext.java:100) with 1 output partitions
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 131 
(collect at HoodieSparkEngineContext.java:100)
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Missing parents: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting ResultStage 131 
(MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100), which has 
no missing parents
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89 stored as values 
in memory (estimated size 71.9 KB, free 364.0 MB)
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89_piece0 stored as 
bytes in memory (estimated size 26.3 KB, free 364.0 MB)
22/03/29 20:58:02 INFO storage.BlockManagerInfo: Added broadcast_89_piece0 in 
memory on adhoc-1:38703 (size: 26.3 KB, free: 365.7 MB)
22/03/29 20:58:02 INFO spark.SparkContext: Created broadcast 89 from broadcast 
at DAGScheduler.scala:1161
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 131 (MapPartitionsRDD[235] at map at 
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Adding task set 131.0 with 
1 tasks
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
131.0 (TID 4081, localhost, executor driver, partition 0, PROCESS_LOCAL, 7803 
bytes)
22/03/29 20:58:02 INFO executor.Executor: Running task 0.0 in stage 131.0 (TID 
4081)
22/03/29 20:58:02 INFO executor.Executor: Finished task 0.0 in stage 131.0 (TID 
4081). 1167 bytes result sent to driver
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
131.0 (TID 4081) in 17 ms on localhost (executor driver) (1/1)
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 131.0, 
whose tasks have all completed, from pool
22/03/29 20:58:02 INFO scheduler.DAGScheduler: ResultStage 131 (collect at 
HoodieSparkEngineContext.java:100) finished in 0.030 s
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Job 63 finished: collect at 
HoodieSparkEngineContext.java:100, took 0.032364 s
22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220329205734338__commit__COMPLETED]}
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: Took 0 ms to read  0 
instants, 0 replaced file groups
22/03/29 20:58:02 INFO util.ClusteringUtils: Found 0 files in pending 
clustering operations
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0
22/03/29 20:58:02 INFO hudi.HoodieFileIndex: Refresh table 
bq_demo_partitioned_cow, spend: 124 ms
22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow
22/03/29 

[jira] [Created] (HUDI-3748) Hudi fails to insert into a partitioned table when the partition column is dropped from the parquet schema

2022-03-29 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3748:
-

 Summary: Hudi fails to insert into a partitioned table when the 
partition column is dropped from the parquet schema
 Key: HUDI-3748
 URL: https://issues.apache.org/jira/browse/HUDI-3748
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinoth Govindarajan


When you add this config to drop the partition column from the parquet schema 
to support BigQuery, hudi fails to insert with the following error.

 

Steps to reproduce:

Log into spark-sql and execute the following commands:


{code:java}
create table bq_demo_partitioned_cow (
   id bigint,
   name string,
   price double,
   ts bigint,
   dt string
) using hudi
partitioned by (dt)
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
);


insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(1, 'a1', 
10, current_timestamp());
insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 
20, current_timestamp()); {code}
Error:
{code:java}
22/03/29 20:58:02 INFO spark.SparkContext: Starting job: collect at 
HoodieSparkEngineContext.java:100
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Got job 63 (collect at 
HoodieSparkEngineContext.java:100) with 1 output partitions
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 131 
(collect at HoodieSparkEngineContext.java:100)
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Missing parents: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting ResultStage 131 
(MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100), which has 
no missing parents
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89 stored as values 
in memory (estimated size 71.9 KB, free 364.0 MB)
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89_piece0 stored as 
bytes in memory (estimated size 26.3 KB, free 364.0 MB)
22/03/29 20:58:02 INFO storage.BlockManagerInfo: Added broadcast_89_piece0 in 
memory on adhoc-1:38703 (size: 26.3 KB, free: 365.7 MB)
22/03/29 20:58:02 INFO spark.SparkContext: Created broadcast 89 from broadcast 
at DAGScheduler.scala:1161
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 131 (MapPartitionsRDD[235] at map at 
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Adding task set 131.0 with 
1 tasks
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
131.0 (TID 4081, localhost, executor driver, partition 0, PROCESS_LOCAL, 7803 
bytes)
22/03/29 20:58:02 INFO executor.Executor: Running task 0.0 in stage 131.0 (TID 
4081)
22/03/29 20:58:02 INFO executor.Executor: Finished task 0.0 in stage 131.0 (TID 
4081). 1167 bytes result sent to driver
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
131.0 (TID 4081) in 17 ms on localhost (executor driver) (1/1)
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 131.0, 
whose tasks have all completed, from pool
22/03/29 20:58:02 INFO scheduler.DAGScheduler: ResultStage 131 (collect at 
HoodieSparkEngineContext.java:100) finished in 0.030 s
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Job 63 finished: collect at 
HoodieSparkEngineContext.java:100, took 0.032364 s
22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220329205734338__commit__COMPLETED]}
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: Took 0 ms to read  0 
instants, 0 replaced file groups
22/03/29 20:58:02 INFO util.ClusteringUtils: Found 0 files in pending 
clustering operations
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0
22/03/29 20:58:02 INFO hudi.HoodieFileIndex: Refresh table 
bq_demo_partitioned_cow, spend: 124 ms
22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow
22/03/29 20:58:02 INFO table.HoodieTableConfig: Loading table properties from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow/.hoodie/hoodie.properties
22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow
22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: 

[jira] [Updated] (HUDI-3020) Create a utility to generate mainfest file

2022-03-28 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3020:
--
Status: Patch Available  (was: In Progress)

> Create a utility to generate mainfest file
> --
>
> Key: HUDI-3020
> URL: https://issues.apache.org/jira/browse/HUDI-3020
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Govindarajan
>Assignee: Joyan Sil
>Priority: Major
>  Labels: pull-request-available
>
> Create a utility to generate manifest file which contains the latest snapshot 
> files for each partition in a CSV format with only one column filename.
>  
> This is the first step towards integrating hudi with snowflake.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3020) Create a utility to generate mainfest file

2022-03-28 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3020:
--
Epic Link: HUDI-2438  (was: HUDI-2832)

> Create a utility to generate mainfest file
> --
>
> Key: HUDI-3020
> URL: https://issues.apache.org/jira/browse/HUDI-3020
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Govindarajan
>Assignee: Joyan Sil
>Priority: Major
>  Labels: pull-request-available
>
> Create a utility to generate manifest file which contains the latest snapshot 
> files for each partition in a CSV format with only one column filename.
>  
> This is the first step towards integrating hudi with snowflake.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3357) Implement the BigQuerySyncTool

2022-03-25 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-3357:
--
Status: Patch Available  (was: In Progress)

> Implement the BigQuerySyncTool
> --
>
> Key: HUDI-3357
> URL: https://issues.apache.org/jira/browse/HUDI-3357
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync, reader-core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi 
> table into BigQuery table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3357) Implement the BigQuerySyncTool

2022-03-24 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512104#comment-17512104
 ] 

Vinoth Govindarajan commented on HUDI-3357:
---

[~xushiyan] - Thanks for the remainder and being flexible to accommodate this 
work as part of the 0.11.0 release, here is the MVP version of the PR: 
[https://github.com/apache/hudi/pull/5125]

 

I will add more unit tests in a few days.

> Implement the BigQuerySyncTool
> --
>
> Key: HUDI-3357
> URL: https://issues.apache.org/jira/browse/HUDI-3357
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync, reader-core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi 
> table into BigQuery table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3020) Create a utility to generate mainfest file

2022-03-08 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-3020:
-

Assignee: Prashant Wason  (was: Vinoth Govindarajan)

> Create a utility to generate mainfest file
> --
>
> Key: HUDI-3020
> URL: https://issues.apache.org/jira/browse/HUDI-3020
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Govindarajan
>Assignee: Prashant Wason
>Priority: Major
>
> Create a utility to generate manifest file which contains the latest snapshot 
> files for each partition in a CSV format with only one column filename.
>  
> This is the first step towards integrating hudi with snowflake.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3570) [Umbrella] [RFC-TBD] Integrate Hudi with Dremio

2022-03-06 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3570:
-

 Summary: [Umbrella] [RFC-TBD] Integrate Hudi with Dremio
 Key: HUDI-3570
 URL: https://issues.apache.org/jira/browse/HUDI-3570
 Project: Apache Hudi
  Issue Type: Epic
  Components: connectors
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


In many large enterprises, especially banks, there is a strong DEMAND for BI. 
Leaders need to look at BI charts every day. As data gradually increases, BI 
queries are slow, which will drive the acceleration of queries. Pre-accelerated 
data is required to meet requirements. Dremio can meet this need by analyzing 
historical SQL and automating accelerated data. Hudi has a lot of features that 
Iceberg doesn't, so we tend to use hudi. But Dremio does not currently support 
Hudi.

 

Support Issue in GH: https://github.com/apache/hudi/issues/4627



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3371) Add Dremio integration support to hudi

2022-02-07 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-3371:
-

Assignee: Vinoth Govindarajan

> Add Dremio integration support to hudi
> --
>
> Key: HUDI-3371
> URL: https://issues.apache.org/jira/browse/HUDI-3371
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Vinoth Govindarajan
>Priority: Major
>
> For BI reports and dashboards, dremio gives accelerated queries. So adding 
> dremio integration w/ hudi will be a added value to hudi users in general. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3357) Implement the BigQuerySyncTool

2022-01-31 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3357:
-

 Summary: Implement the BigQuerySyncTool
 Key: HUDI-3357
 URL: https://issues.apache.org/jira/browse/HUDI-3357
 Project: Apache Hudi
  Issue Type: New Feature
  Components: hive-sync
Reporter: Vinoth Govindarajan


Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi 
table into BigQuery table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3357) Implement the BigQuerySyncTool

2022-01-31 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-3357:
-

Assignee: Vinoth Govindarajan

> Implement the BigQuerySyncTool
> --
>
> Key: HUDI-3357
> URL: https://issues.apache.org/jira/browse/HUDI-3357
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive-sync
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>
> Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi 
> table into BigQuery table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3290) Make the .hoodie-partition-metadata as empty parquet file

2022-01-26 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-3290:
-

Assignee: Prashant Wason  (was: Vinoth Govindarajan)

> Make the .hoodie-partition-metadata as empty parquet file
> -
>
> Key: HUDI-3290
> URL: https://issues.apache.org/jira/browse/HUDI-3290
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: Vinoth Govindarajan
>Assignee: Prashant Wason
>Priority: Major
>
> For BigQuery and Snowflake integration, we can't able to create external 
> tables when the partition folder has a non-parquet file 
> `.hoodie-partition-metadata`.
> I understand this is an important file to find the .hoodie folder from within 
> the partition folder, the long term solution is to get rid of this file, but 
> as a short term solution if we can convert this to an empty parquet file and 
> add the necessary depth information in the footer, then it will pass the 
> BigQuery/Snowflake external table validation and allow us to create an 
> external parquet table on top of hudi folder structure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3290) Make the .hoodie-partition-metadata as empty parquet file

2022-01-20 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3290:
-

 Summary: Make the .hoodie-partition-metadata as empty parquet file
 Key: HUDI-3290
 URL: https://issues.apache.org/jira/browse/HUDI-3290
 Project: Apache Hudi
  Issue Type: New Feature
  Components: metadata
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


For BigQuery and Snowflake integration, we can't able to create external tables 
when the partition folder has a non-parquet file `.hoodie-partition-metadata`.

I understand this is an important file to find the .hoodie folder from within 
the partition folder, the long term solution is to get rid of this file, but as 
a short term solution if we can convert this to an empty parquet file and add 
the necessary depth information in the footer, then it will pass the 
BigQuery/Snowflake external table validation and allow us to create an external 
parquet table on top of hudi folder structure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-12-27 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859
 ] 

Vinoth Govindarajan edited comment on HUDI-2438 at 12/27/21, 8:01 PM:
--

Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 * Let's say you have a Hudi table data on google cloud storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
  id bigint,
  name string,
  price double,
  ts bigint,
  dt string
) using hudi
partitioned by (dt)
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
)
location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 
{code:java}
hoodie.datasource.write.drop.partition.columns = 'true'{code}
 * Generate a manifest file for the Hudi table which has the list of the latest 
snapshot parquet file names in a CSV format with only one column the file name. 
The location of the manifest file should be on the .hoodie metadata folder 
(`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`)

{code:java}
// this command is coming soon.
GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code}
 * Create a BQ table named `table_name_manifest` with only one column filename 
with this location 
`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`
(
  filename STRING
)
OPTIONS(
  format="CSV",
  
uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"]
);{code}
 * Create another BQ table named `table_name_history` with this location 
`gs://bucket_name/table_name`, don't use this table to query the data, this 
table will have duplicate records since it scans all the versions of parquet 
files in the table/partition folders.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WITH PARTITION COLUMNS
OPTIONS(
  ignore_unknown_values=true,
  format="PARQUET",
  hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/",
  uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"]
);{code}
 * Create a BQ view named `table_name` with this query: 

{code:java}
CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow`
AS SELECT
  *
FROM
  `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WHERE
  _hoodie_file_name IN (
  SELECT
filename
  FROM
`golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code}
 * The last view you created has the data from the Hudi table without any 
duplicates, you can use that table to query the data.

 

To make this model work, we need a way to generate the manifest file with the 
latest snapshot files, I will create a PR for that soon. One more final step, 
Hudi generates multiple non-parquet files in the table/partition location like 
(crc + .hoodie_partition_metadata):
{code:java}
..hoodie_partition_metadata.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc
.hoodie_partition_metadata
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code}
We need to remove these non-parquet files to make the BQ integration work, I 
will talk to the Hudi core team on how to get rid of these files from the 
partition location.

 


was (Author: vino):
Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 * Let's say you have a Hudi table data on google storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
  id bigint,
  name string,
  price double,
  ts bigint,
  dt string
) using hudi
partitioned by (dt)
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
)
location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this 

[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-12-27 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859
 ] 

Vinoth Govindarajan edited comment on HUDI-2438 at 12/27/21, 8:00 PM:
--

Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 * Let's say you have a Hudi table data on google storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
  id bigint,
  name string,
  price double,
  ts bigint,
  dt string
) using hudi
partitioned by (dt)
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
)
location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 
{code:java}
hoodie.datasource.write.drop.partition.columns = 'true'{code}
 * Generate a manifest file for the Hudi table which has the list of the latest 
snapshot parquet file names in a CSV format with only one column the file name. 
The location of the manifest file should be on the .hoodie metadata folder 
(`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`)

{code:java}
// this command is coming soon.
GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code}
 * Create a BQ table named `table_name_manifest` with only one column filename 
with this location 
`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`
(
  filename STRING
)
OPTIONS(
  format="CSV",
  
uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"]
);{code}
 * Create another BQ table named `table_name_history` with this location 
`gs://bucket_name/table_name`, don't use this table to query the data, this 
table will have duplicate records since it scans all the versions of parquet 
files in the table/partition folders.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WITH PARTITION COLUMNS
OPTIONS(
  ignore_unknown_values=true,
  format="PARQUET",
  hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/",
  uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"]
);{code}
 * Create a BQ view named `table_name` with this query: 

{code:java}
CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow`
AS SELECT
  *
FROM
  `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WHERE
  _hoodie_file_name IN (
  SELECT
filename
  FROM
`golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code}
 * The last view you created has the data from the Hudi table without any 
duplicates, you can use that table to query the data.

 

To make this model work, we need a way to generate the manifest file with the 
latest snapshot files, I will create a PR for that soon. One more final step, 
Hudi generates multiple non-parquet files in the table/partition location like 
(crc + .hoodie_partition_metadata):
{code:java}
..hoodie_partition_metadata.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc
.hoodie_partition_metadata
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code}
We need to remove these non-parquet files to make the BQ integration work, I 
will talk to the Hudi core team on how to get rid of these files from the 
partition location.

 


was (Author: vino):
Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 # Let's say you have a Hudi table data on google storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
  id bigint,
  name string,
  price double,
  ts bigint,
  dt string
) using hudi
partitioned by (dt)
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
)
location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 


[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-12-27 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465859#comment-17465859
 ] 

Vinoth Govindarajan commented on HUDI-2438:
---

Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 # Let's say you have a Hudi table data on google storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
  id bigint,
  name string,
  price double,
  ts bigint,
  dt string
) using hudi
partitioned by (dt)
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts',
hoodie.datasource.write.drop.partition.columns = 'true'
)
location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 

{code:java}
hoodie.datasource.write.drop.partition.columns = 'true'{code}

 # Generate a manifest file for the Hudi table which has the list of the latest 
snapshot parquet file names in a CSV format with only one column the file name. 
The location of the manifest file should be on the .hoodie metadata folder 
(`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`)

{code:java}
// this command is coming soon.
GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code}

 # Create a BQ table named `table_name_manifest` with only one column filename 
with this location 
`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`
(
  filename STRING
)
OPTIONS(
  format="CSV",
  
uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"]
);{code}

 # Create another BQ table named `table_name_history` with this location 
`gs://bucket_name/table_name`, don't use this table to query the data, this 
table will have duplicate records since it scans all the versions of parquet 
files in the table/partition folders.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WITH PARTITION COLUMNS
OPTIONS(
  ignore_unknown_values=true,
  format="PARQUET",
  hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/",
  uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"]
);{code}

 # Create a BQ view named `table_name` with this query: 

{code:java}
CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow`
AS SELECT
  *
FROM
  `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WHERE
  _hoodie_file_name IN (
  SELECT
filename
  FROM
`golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code}

 # The last view you created has the data from the Hudi table without any 
duplicates, you can use that table to query the data.

 

To make this model work, we need a way to generate the manifest file with the 
latest snapshot files, I will create a PR for that soon. One more final step, 
Hudi generates multiple non-parquet files in the table/partition location like 
(crc + .hoodie_partition_metadata):
{code:java}
..hoodie_partition_metadata.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc
.hoodie_partition_metadata
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code}
We need to remove these non-parquet files to make the BQ integration work, I 
will talk to the Hudi core team on how to get rid of these files from the 
partition location.

 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as 

[jira] [Created] (HUDI-3091) Make simple index as the default hoodie.index.type

2021-12-21 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3091:
-

 Summary: Make simple index as the default hoodie.index.type
 Key: HUDI-3091
 URL: https://issues.apache.org/jira/browse/HUDI-3091
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Index
Reporter: Vinoth Govindarajan


When performing upserts with derived datasets, we often run into an OOM issue 
with the bloom filter, hence we changed all the dataset index types to simple 
to resolve the issue.

 

Some of the tables were non-partitioned tables for which bloom index is not the 
right choice.

I'm proposing to make a simple index as the default value and on case-by-case 
basics, folks can choose the bloom filter for additional performance gains 
offered by bloom filters.

 

I agree that the performance will not be optimal but for regular use cases 
simple index would not break and give them sub-optimal read/write performance 
but it won't break any ingestion/derived jobs.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3020) Create a utility to generate mainfest file

2021-12-14 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3020:
-

 Summary: Create a utility to generate mainfest file
 Key: HUDI-3020
 URL: https://issues.apache.org/jira/browse/HUDI-3020
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


Create a utility to generate manifest file which contains the latest snapshot 
files for each partition in a CSV format with only one column filename.

 

This is the first step towards integrating hudi with snowflake.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration

2021-12-14 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2832:
--
Status: In Progress  (was: Open)

> [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake 
> Integration
> 
>
> Key: HUDI-2832
> URL: https://issues.apache.org/jira/browse/HUDI-2832
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration, pull-request-available
> Fix For: 0.11.0
>
>
> Snowflake is a fully managed service that’s simple to use but can power a 
> near-unlimited number of concurrent workloads. Snowflake is a solution for 
> data warehousing, data lakes, data engineering, data science, data 
> application development, and securely sharing and consuming shared data. 
> Snowflake [doesn’t 
> support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html]
>  Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
> file format. This proposal is to implement a SnowflakeSync similar to 
> HiveSync to sync the Hudi table as the Snowflake External Parquet table so 
> that users can query the Hudi tables using Snowflake. Many users have 
> expressed interest in Hudi and other support channels asking to integrate 
> Hudi with Snowflake, this will unlock new use cases for Hudi.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)

2021-12-07 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2319:
--
Fix Version/s: 0.10.0
   (was: 0.11.0)

> Integrate hudi with dbt (data build tool)
> -
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Critical
>  Labels: integration
> Fix For: 0.10.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)

2021-12-07 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2319:
--
Status: Resolved  (was: Patch Available)

> Integrate hudi with dbt (data build tool)
> -
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Critical
>  Labels: integration
> Fix For: 0.10.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2319) Integrate hudi with dbt (data build tool)

2021-12-07 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454963#comment-17454963
 ] 

Vinoth Govindarajan commented on HUDI-2319:
---

Hudi support has been added to dbt, here is the dbt-spark 
[PR|https://github.com/dbt-labs/dbt-spark/pull/210].

> Integrate hudi with dbt (data build tool)
> -
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Critical
>  Labels: integration
> Fix For: 0.11.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration

2021-11-22 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2832:
--
Description: Snowflake is a fully managed service that’s simple to use but 
can power a near-unlimited number of concurrent workloads. Snowflake is a 
solution for data warehousing, data lakes, data engineering, data science, data 
application development, and securely sharing and consuming shared data. 
Snowflake [doesn’t 
support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html] 
Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
file format. This proposal is to implement a SnowflakeSync similar to HiveSync 
to sync the Hudi table as the Snowflake External Parquet table so that users 
can query the Hudi tables using Snowflake. Many users have expressed interest 
in Hudi and other support channels asking to integrate Hudi with Snowflake, 
this will unlock new use cases for Hudi.  (was: BigQuery is Google Cloud's 
fully managed, petabyte-scale, and cost-effective analytics data warehouse that 
lets you run analytics over vast amounts of data in near real-time. BigQuery 
currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.

 

More details are in RFC-34: 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980])

> [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake 
> Integration
> -
>
> Key: HUDI-2832
> URL: https://issues.apache.org/jira/browse/HUDI-2832
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.11.0
>
>
> Snowflake is a fully managed service that’s simple to use but can power a 
> near-unlimited number of concurrent workloads. Snowflake is a solution for 
> data warehousing, data lakes, data engineering, data science, data 
> application development, and securely sharing and consuming shared data. 
> Snowflake [doesn’t 
> support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html]
>  Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
> file format. This proposal is to implement a SnowflakeSync similar to 
> HiveSync to sync the Hudi table as the Snowflake External Parquet table so 
> that users can query the Hudi tables using Snowflake. Many users have 
> expressed interest in Hudi and other support channels asking to integrate 
> Hudi with Snowflake, this will unlock new use cases for Hudi.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake Integration

2021-11-22 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2832:
--
Summary: [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to 
Snowflake Integration  (was: [Umbrella] [RFC-40] Implement SnowflakeSyncTool 
for Hudi to Snowflake Integration)

> [Umbrella] [RFC-40] Implement SnowflakeSyncTool to support Hudi to Snowflake 
> Integration
> 
>
> Key: HUDI-2832
> URL: https://issues.apache.org/jira/browse/HUDI-2832
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.11.0
>
>
> Snowflake is a fully managed service that’s simple to use but can power a 
> near-unlimited number of concurrent workloads. Snowflake is a solution for 
> data warehousing, data lakes, data engineering, data science, data 
> application development, and securely sharing and consuming shared data. 
> Snowflake [doesn’t 
> support|https://docs.snowflake.com/en/sql-reference/sql/alter-file-format.html]
>  Apache Hudi file format yet, but it has support for Parquet, ORC, and Delta 
> file format. This proposal is to implement a SnowflakeSync similar to 
> HiveSync to sync the Hudi table as the Snowflake External Parquet table so 
> that users can query the Hudi tables using Snowflake. Many users have 
> expressed interest in Hudi and other support channels asking to integrate 
> Hudi with Snowflake, this will unlock new use cases for Hudi.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2832) [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi to Snowflake Integration

2021-11-22 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2832:
-

 Summary: [Umbrella] [RFC-40] Implement SnowflakeSyncTool for Hudi 
to Snowflake Integration
 Key: HUDI-2832
 URL: https://issues.apache.org/jira/browse/HUDI-2832
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 0.11.0


BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
analytics data warehouse that lets you run analytics over vast amounts of data 
in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.

 

More details are in RFC-34: 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-11-22 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447608#comment-17447608
 ] 

Vinoth Govindarajan commented on HUDI-2438:
---

Hi [~qiao.xu]! 

After investigation, I've found out that the BigQuery doesn't support any kind 
of manifest file to read the metadata of which file to read, so it's making 
this implementation very difficult, now the only option is to select the set of 
correct parquet files for each and every commit and sync only those files to 
BigQuery GFS so it would break snapshot isolation and other benefits of Hudi 
time-travel.

I'm focusing on other initiatives like dbt and snowflake integration, I'll look 
into this after completing both those initiatives.

 

 

 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2681) Make hoodie record_key and preCombine_key optional

2021-11-03 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-2681:
-

Assignee: sivabalan narayanan

> Make hoodie record_key and preCombine_key optional
> --
>
> Key: HUDI-2681
> URL: https://issues.apache.org/jira/browse/HUDI-2681
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: 0.10_blocker
>
> At present, Hudi needs an record key and preCombine key to create an Hudi 
> datasets, which puts an restriction on the kinds of datasets we can create 
> using Hudi.
>  
> In order to increase the adoption of Hudi file format across all kinds of 
> derived datasets, similar to Parquet/ORC, we need to offer flexibility to 
> users. I understand that record key is used for upsert primitive and we need 
> preCombine key to break the tie and deduplicate, but there are event data and 
> other datasets without any primary key (append only datasets), which can 
> benefit from Hudi since Hudi ecosystem offers other features such as snapshot 
> isolation, indexes, clustering, delta streamer etc., which could be applied 
> to any datasets without record key.
>  
> The idea of this proposal is to make both the record key and preCombine key 
> optional to allow variety of new use cases on top of Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2681) Make hoodie record_key and preCombine_key optional

2021-11-03 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2681:
-

 Summary: Make hoodie record_key and preCombine_key optional
 Key: HUDI-2681
 URL: https://issues.apache.org/jira/browse/HUDI-2681
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: Vinoth Govindarajan


At present, Hudi needs an record key and preCombine key to create an Hudi 
datasets, which puts an restriction on the kinds of datasets we can create 
using Hudi.

 

In order to increase the adoption of Hudi file format across all kinds of 
derived datasets, similar to Parquet/ORC, we need to offer flexibility to 
users. I understand that record key is used for upsert primitive and we need 
preCombine key to break the tie and deduplicate, but there are event data and 
other datasets without any primary key (append only datasets), which can 
benefit from Hudi since Hudi ecosystem offers other features such as snapshot 
isolation, indexes, clustering, delta streamer etc., which could be applied to 
any datasets without record key.

 

The idea of this proposal is to make both the record key and preCombine key 
optional to allow variety of new use cases on top of Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2510) QuickStart html page is showing 404

2021-10-05 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2510:
--
Status: In Progress  (was: Open)

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2510) QuickStart html page is showing 404

2021-10-05 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2510:
--
Status: Patch Available  (was: In Progress)

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2510) QuickStart html page is showing 404

2021-10-05 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424790#comment-17424790
 ] 

Vinoth Govindarajan commented on HUDI-2510:
---

[~rmahindra] - I have explored the option to add a redirect in the config file, 
but for some reason its not allowing/working, hence I create a new quickstart 
page and did the redirection using JS.  This is the PR for the same, please 
review: https://github.com/apache/hudi/pull/3753

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-09-15 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2438:
--
Description: 
BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
analytics data warehouse that lets you run analytics over vast amounts of data 
in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.

 

More details are in RFC-34: 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]

  was:
BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
analytics data warehouse that lets you run analytics over vast amounts of data 
in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.

 

RFC-34: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980


> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.10.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-09-15 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2438:
--
Description: 
BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
analytics data warehouse that lets you run analytics over vast amounts of data 
in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.

 

RFC-34: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980

  was:BigQuery is Google Cloud's fully managed, petabyte-scale, and 
cost-effective analytics data warehouse that lets you run analytics over vast 
amounts of data in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.


> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> 
>
> Key: HUDI-2438
> URL: https://issues.apache.org/jira/browse/HUDI-2438
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: BigQuery, Integration
> Fix For: 0.10.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> RFC-34: 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

2021-09-15 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2438:
-

 Summary: [Umbrella] [RFC-34] Implement BigQuerySyncTool for 
BigQuery Sync
 Key: HUDI-2438
 URL: https://issues.apache.org/jira/browse/HUDI-2438
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 0.10.0


BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
analytics data warehouse that lets you run analytics over vast amounts of data 
in near real-time. BigQuery currently [doesn’t 
support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
Hudi file format, but it has support for the Parquet file format. The proposal 
is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as 
the BigQuery External Parquet table so that users can query the Hudi tables 
using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery 
data mart this will help them to write, sync, and query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2407) [Website] Support staging site for per pull request

2021-09-08 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2407:
-

 Summary: [Website] Support staging site for per pull request
 Key: HUDI-2407
 URL: https://issues.apache.org/jira/browse/HUDI-2407
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Docs
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


Provide a staging site for asf website pull request.

 

Something similar to this Jekyll site PR:

https://github.com/apache/hudi/pull/1698



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS

2021-08-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2357:
--
Description: 
MERGE INTO command doesn't select the correct primary key for tables created 
using CTAS, whereas it works for tables created using CREATE TABLE command.

I guess we are hitting this issue because the key generator class is set to 
SqlKeyGenerator for tables created using CTAS:

working use-case:
{code:java}
create table h5 (id bigint, name string, ts bigint) using hudi
options (type = "cow" , primaryKey="id" , preCombineField="ts" );

merge into h5 as t0
using (
select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts
) t1
on t1.s_id = t0.id
when matched then update set * 
when not matched then insert *;
{code}
hoodie.properties for working use-case:
{code:java}
➜  analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties
#Properties saved on Wed Aug 25 04:10:33 UTC 2021
#Wed Aug 25 04:10:33 UTC 2021
hoodie.table.name=h5
hoodie.table.recordkey.fields=id
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]}
hoodie.timeline.layout.version=1
hoodie.table.version=1{code}
 

Whereas this doesn't work:
{code:java}
create table h4 using hudi options (type = "cow" , primaryKey="id" , 
preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, 
current_timestamp();

merge into h3 as t0u sing (select '5' as s_id, 'vinoth' as s_name, 
current_timestamp() as s_ts) t1 on t1.s_id = t0.id when matched then update set 
* when not matched then insert *;

ERROR LOG
544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver  - 
Failed in [merge into analytics.h3 as t0using (    select '5' as s_id, 'vinoth' 
as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then 
update set *when not matched then insert *]java.lang.IllegalArgumentException: 
Merge Key[id] is not Equal to the defined primary key[] in table h3 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425)
 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
 at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at 
org.apache.spark.sql.Dataset.(Dataset.scala:229) at 
org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:490)
 at scala.collection.Iterator.foreach(Iterator.scala:941) at 
scala.collection.Iterator.foreach$(Iterator.scala:941) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at 

[jira] [Updated] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS

2021-08-24 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2357:
--
Description: 
MERGE INTO command doesn't select the correct primary key for tables created 
using CTAS, whereas it works for tables created using CREATE TABLE command.

I guess we are hitting this issue because the key generator class is set to 
SqlKeyGenerator for tables created using CTAS:

working use-case:
{code:java}
create table h5 (id bigint, name string, ts bigint) using hudi
options (type = "cow" , primaryKey="id" , preCombineField="ts" );

merge into h5 as t0
using (
select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts
) t1
on t1.s_id = t0.id
when matched then update set * 
when not matched then insert *;
{code}
hoodie.properties for working use-case:
{code:java}
➜  analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties
#Properties saved on Wed Aug 25 04:10:33 UTC 2021
#Wed Aug 25 04:10:33 UTC 2021
hoodie.table.name=h5
hoodie.table.recordkey.fields=id
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]}
hoodie.timeline.layout.version=1
hoodie.table.version=1{code}
 

Whereas this doesn't work:
{code:java}
create table h4create table h4using hudioptions (type = "cow" , primaryKey="id" 
, preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, 
current_timestamp();

merge into h3 as t0using (    select '5' as s_id, 'vinoth' as s_name, 
current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set * 
when not matched then insert *;

ERROR LOG
544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver  - 
Failed in [merge into analytics.h3 as t0using (    select '5' as s_id, 'vinoth' 
as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then 
update set *when not matched then insert *]java.lang.IllegalArgumentException: 
Merge Key[id] is not Equal to the defined primary key[] in table h3 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425)
 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
 at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at 
org.apache.spark.sql.Dataset.(Dataset.scala:229) at 
org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:490)
 at scala.collection.Iterator.foreach(Iterator.scala:941) at 
scala.collection.Iterator.foreach$(Iterator.scala:941) at 

[jira] [Created] (HUDI-2357) MERGE INTO doesn't work for tables created using CTAS

2021-08-24 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2357:
-

 Summary: MERGE INTO doesn't work for tables created using CTAS
 Key: HUDI-2357
 URL: https://issues.apache.org/jira/browse/HUDI-2357
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: Vinoth Govindarajan
Assignee: pengzhiwei
 Fix For: 0.9.0


MERGE INTO command doesn't select the correct primaryKey for tables created 
using CTAS, whereas it works for tables created using CREATE TABLE command.

I guess we are hitting this issue because the keygenerator class is set to 
SqlKeyGenerator for tables created using CTAS:

 

This works:

 
{code:java}
create table h5 (id bigint, name string, ts bigint) using hudi
options (type = "cow" , primaryKey="id" , preCombineField="ts" );

merge into h5 as t0
using (
select 5 as s_id, 'vinoth' as s_name, current_timestamp() as s_ts
) t1
on t1.s_id = t0.id
when matched then update set * 
when not matched then insert *;
{code}
hoodie.properties for working use-case:

 

 
{code:java}
➜  analytics.db git:(apache_hudi_support) cat h5/.hoodie/hoodie.properties
#Properties saved on Wed Aug 25 04:10:33 UTC 2021
#Wed Aug 25 04:10:33 UTC 2021
hoodie.table.name=h5
hoodie.table.recordkey.fields=id
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"ts","type"\:["long","null"]}]}
hoodie.timeline.layout.version=1
hoodie.table.version=1{code}
 

Whereas this doesn't work:
{code:java}
create table h4create table h4using hudioptions (type = "cow" , primaryKey="id" 
, preCombineField="ts" ) as select 5 as id, cast(rand() as string) as name, 
current_timestamp();
merge into h3 as t0using (    select '5' as s_id, 'vinoth' as s_name, 
current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then update set * 
when not matched then insert *;
544702 [main] ERROR org.apache.spark.sql.hive.thriftserver.SparkSQLDriver  - 
Failed in [merge into analytics.h3 as t0using (    select '5' as s_id, 'vinoth' 
as s_name, current_timestamp() as s_ts) t1on t1.s_id = t0.idwhen matched then 
update set *when not matched then insert *]java.lang.IllegalArgumentException: 
Merge Key[id] is not Equal to the defined primary key[] in table h3 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:425)
 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:147)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
 at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at 
org.apache.spark.sql.Dataset.(Dataset.scala:229) at 
org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at 
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at 

[jira] [Updated] (HUDI-2319) Integrate hudi with dbt (data build tool)

2021-08-18 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2319:
--
Summary: Integrate hudi with dbt (data build tool)  (was: Integrate hudi 
with dbt)

> Integrate hudi with dbt (data build tool)
> -
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: integration
> Fix For: 0.10.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2319) Integrate hudi with dbt

2021-08-18 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-2319:
--
Status: In Progress  (was: Open)

> Integrate hudi with dbt
> ---
>
> Key: HUDI-2319
> URL: https://issues.apache.org/jira/browse/HUDI-2319
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: integration
> Fix For: 0.10.0
>
>
> dbt (data build tool) enables analytics engineers to transform data in their 
> warehouses by simply writing select statements. dbt handles turning these 
> select statements into tables and views.
> dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it 
> doesn’t extract or load data, but it’s extremely good at transforming data 
> that’s already loaded into your warehouse.
>  
> dbt currently supports only delta file format, there are few folks in the dbt 
> community asking for hudi integration, moreover, adding dbt integration will 
> make it easier to create derived datasets for any data engineer/analyst.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2319) Integrate hudi with dbt

2021-08-18 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-2319:
-

 Summary: Integrate hudi with dbt
 Key: HUDI-2319
 URL: https://issues.apache.org/jira/browse/HUDI-2319
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Usability
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 0.10.0


dbt (data build tool) enables analytics engineers to transform data in their 
warehouses by simply writing select statements. dbt handles turning these 
select statements into tables and views.

dbt does the {{T}} in {{ELT}} (Extract, Load, Transform) processes – it doesn’t 
extract or load data, but it’s extremely good at transforming data that’s 
already loaded into your warehouse.

 

dbt currently supports only delta file format, there are few folks in the dbt 
community asking for hudi integration, moreover, adding dbt integration will 
make it easier to create derived datasets for any data engineer/analyst.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer

2021-08-11 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397526#comment-17397526
 ] 

Vinoth Govindarajan commented on HUDI-2275:
---

I can confirm that the is the correct settings to use for backfill pipelines 
and it works for our use case, in our case we have set this option for our 
concurrent backfill pipelines so that the backfill pipelines don't change the 
delta streamer checkpoint key and copies over the last checkpoint key for the 
backfill commit, hence not interrupting the regular incremental query.
{code:java}
option("hoodie.write.meta.key.prefixes", "deltastreamer.checkpoint.key")
{code}
 

> HoodieDeltaStreamerException when using OCC and a second concurrent writer
> --
>
> Key: HUDI-2275
> URL: https://issues.apache.org/jira/browse/HUDI-2275
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Affects Versions: 0.9.0
>Reporter: Dave Hagman
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.10.0
>
>
>  I am trying to utilize [Optimistic Concurrency 
> Control|https://hudi.apache.org/docs/concurrency_control] in order to allow 
> two writers to update a single table simultaneously. The two writers are:
>  * Writer A: Deltastreamer job consuming continuously from Kafka
>  * Writer B: A spark datasource-based writer that is consuming parquet files 
> out of S3
>  * Table Type: Copy on Write
>  
> After a few commits from each writer the deltastreamer will fail with the 
> following exception:
>  
> {code:java}
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find 
> previous checkpoint. Please double check if this table was indeed built via 
> delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, 
> Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={
>  "partitionToWriteStats" : {
>  ...{code}
>  
> What appears to be happening is a lack of commit isolation between the two 
> writers
>  Writer B (spark datasource writer) will land commits which are eventually 
> picked up by Writer A (Delta Streamer). This is an issue because the Delta 
> Streamer needs checkpoint information which the spark datasource of course 
> does not include in its commits. My understanding was that OCC was built for 
> this very purpose (among others). 
> OCC config for Delta Streamer:
> {code:java}
> hoodie.write.concurrency.mode=optimistic_concurrency_control
>  hoodie.cleaner.policy.failed.writes=LAZY
>  
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
>  hoodie.write.lock.zookeeper.url=
>  hoodie.write.lock.zookeeper.port=2181
>  hoodie.write.lock.zookeeper.lock_key=writer_lock
>  hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code}
>  
> OCC config for spark datasource:
> {code:java}
> // Multi-writer concurrency
>  .option("hoodie.cleaner.policy.failed.writes", "LAZY")
>  .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>  .option(
>  "hoodie.write.lock.provider",
>  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName()
>  )
>  .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost)
>  .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort)
>  .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock")
>  .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code}
> h3. Steps to Reproduce:
>  * Start a deltastreamer job against some table Foo
>  * In parallel, start writing to the same table Foo using spark datasource 
> writer
>  * Note that after a few commits from each the deltastreamer is likely to 
> fail with the above exception when the datasource writer creates non-isolated 
> inflight commits
> NOTE: I have not tested this with two of the same datasources (ex. two 
> deltastreamer jobs)
> NOTE 2: Another detail that may be relevant is that the two writers are on 
> completely different spark clusters but I assumed this shouldn't be an issue 
> since we're locking using Zookeeper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1985) Website re-design implementation

2021-07-29 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1985:
--
Status: In Progress  (was: Open)

> Website re-design implementation
> 
>
> Key: HUDI-1985
> URL: https://issues.apache.org/jira/browse/HUDI-1985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: documentation, pull-request-available
> Fix For: 0.9.0
>
>
> To provide better navigation and organization of Hudi website's info, we have 
> done a re-design of the web pages.
> Previous discussion
> [https://github.com/apache/hudi/issues/2905]
>  
> See the wireframe and final design in 
> [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6]
> (login Figma to comment)
> The design is ready for implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1985) Website re-design implementation

2021-07-14 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan reassigned HUDI-1985:
-

Assignee: Vinoth Govindarajan

> Website re-design implementation
> 
>
> Key: HUDI-1985
> URL: https://issues.apache.org/jira/browse/HUDI-1985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: Vinoth Govindarajan
>Priority: Blocker
>  Labels: documentation
> Fix For: 0.9.0
>
>
> To provide better navigation and organization of Hudi website's info, we have 
> done a re-design of the web pages.
> Previous discussion
> [https://github.com/apache/hudi/issues/2905]
>  
> See the wireframe and final design in 
> [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6]
> (login Figma to comment)
> The design is ready for implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1985) Website re-design implementation

2021-07-12 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379583#comment-17379583
 ] 

Vinoth Govindarajan commented on HUDI-1985:
---

Hi [~xushiyan],
I have experience in the past building websites, I can volunteer to work on 
this re-design.

 

> Website re-design implementation
> 
>
> Key: HUDI-1985
> URL: https://issues.apache.org/jira/browse/HUDI-1985
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Priority: Blocker
>  Labels: documentation
> Fix For: 0.9.0
>
>
> To provide better navigation and organization of Hudi website's info, we have 
> done a re-design of the web pages.
> Previous discussion
> [https://github.com/apache/hudi/issues/2905]
>  
> See the wireframe and final design in 
> [https://www.figma.com/file/tipod1JZRw7anZRWBI6sZh/Hudi.Apache?node-id=32%3A6]
> (login Figma to comment)
> The design is ready for implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2021-07-09 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378256#comment-17378256
 ] 

Vinoth Govindarajan commented on HUDI-1117:
---

 Even after adding the JSON jar to the classpath, it didn't resolve the issue.

 

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-06-09 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1743:
--
Status: Closed  (was: Patch Available)

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available, sev:normal
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2021-04-28 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1790:
--
Status: Patch Available  (was: In Progress)

> Add SqlSource for DeltaStreamer to support backfill use cases
> -
>
> Key: HUDI-1790
> URL: https://issues.apache.org/jira/browse/HUDI-1790
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Delta Streamer is great for incremental workloads, but we need to support 
> backfills for use cases like adding a new column and backfill only that 
> column for the last 6 months, and if there was a bug in our transformation 
> logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then 
> I can pass any custom Spark SQL queries selecting specific partitions and 
> backfill.
>  
> When we do the backfill, we don't need to update the last processed commit 
> checkpoint, this has to copy the last processed checkpoint before the 
> backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table

2021-04-19 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325466#comment-17325466
 ] 

Vinoth Govindarajan commented on HUDI-1747:
---

[~shivnarayan] - Here are the answers:
 * Do you see the exception during first time itself when trying to 
incrementally read from hudi MOR table via delta streamer? or after few 
incremental pulls via deltastreamer. 
 ** I see it the first time itself.
 * I assume the schema given in the stack trace is the source table schema of 
deltastreamer. Can you confirm that. 
 ** It's a big table, the schema is just the first few fields, not all the 
fields are shown in the log.
 * I need to set aside sometime to try and reproduce this. but if you have 
steps to reproduce w/ configs, would be awesome. but if it might consume too 
much time for you, nvm. 
 ** I've a test pipeline which you can use for debugging, I will share the 
details in slack.

> Deltastreamer incremental read is not working on the MOR table
> --
>
> Key: HUDI-1747
> URL: https://issues.apache.org/jira/browse/HUDI-1747
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Priority: Critical
>  Labels: sev:critical
>
> I was trying to read the MOR HUDI table incrementally using delta streamer, 
> while doing that I ran into this issue where it says:
> {code:java}
> Found recursive reference in Avro schema, which can not be processed by 
> Spark:{code}
> Spark Version: 2.4
> Hudi Version: 0.7.0-SNAPSHOT or the latest master
>  
> Full Stack Trace:
> {code:java}
> Found recursive reference in Avro schema, which can not be processed by Spark:
> {
>   "type" : "record",
>   "name" : "meta",
>   "fields" : [ {
> "name" : "verified",
> "type" : [ "null", "boolean" ],
> "default" : null
>   }, {
> "name" : "zip",
> "type" : [ "null", "string" ],
> "default" : null
>   }, {
> "name" : "lname",
> "type" : [ "null", "string" ],
> "default" : null
>   }]
> }
>   
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> 

[jira] [Updated] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2021-04-12 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1790:
--
Status: In Progress  (was: Open)

> Add SqlSource for DeltaStreamer to support backfill use cases
> -
>
> Key: HUDI-1790
> URL: https://issues.apache.org/jira/browse/HUDI-1790
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>
> Delta Streamer is great for incremental workloads, but we need to support 
> backfills for use cases like adding a new column and backfill only that 
> column for the last 6 months, and if there was a bug in our transformation 
> logic and we need to reprocess a couple of older partitions.
>  
> If we have a SqlSource as one of the input source to the delta streamer, then 
> I can pass any custom Spark SQL queries selecting specific partitions and 
> backfill.
>  
> When we do the backfill, we don't need to update the last processed commit 
> checkpoint, this has to copy the last processed checkpoint before the 
> backfill and copy that over to the backfill commit.
>  
> cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1790) Add SqlSource for DeltaStreamer to support backfill use cases

2021-04-12 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1790:
-

 Summary: Add SqlSource for DeltaStreamer to support backfill use 
cases
 Key: HUDI-1790
 URL: https://issues.apache.org/jira/browse/HUDI-1790
 Project: Apache Hudi
  Issue Type: New Feature
  Components: DeltaStreamer
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


Delta Streamer is great for incremental workloads, but we need to support 
backfills for use cases like adding a new column and backfill only that column 
for the last 6 months, and if there was a bug in our transformation logic and 
we need to reprocess a couple of older partitions.

 

If we have a SqlSource as one of the input source to the delta streamer, then I 
can pass any custom Spark SQL queries selecting specific partitions and 
backfill.

 

When we do the backfill, we don't need to update the last processed commit 
checkpoint, this has to copy the last processed checkpoint before the backfill 
and copy that over to the backfill commit.

 

cc [~nishith29]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-04-08 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1743:
--
Status: Patch Available  (was: In Progress)

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-04-08 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1743:
--
Status: In Progress  (was: Open)

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-08 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317655#comment-17317655
 ] 

Vinoth Govindarajan commented on HUDI-1762:
---

PR merged.

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-08 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1762:
--
Status: Closed  (was: Patch Available)

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently

2021-04-06 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1770:
--
Affects Version/s: 0.7.0

> Deltastreamer throws errors when not running frequently
> ---
>
> Key: HUDI-1770
> URL: https://issues.apache.org/jira/browse/HUDI-1770
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.7.0, 0.8.0
>Reporter: Vinoth Govindarajan
>Priority: Major
>
> When delta streamer is using HoodieIncrSource from another parent Hudi table, 
> it runs into this error, when you are not running your delta streamer 
> pipeline frequently.
>  
> {code:java}
> User class threw exception: org.apache.spark.sql.AnalysisException: Path does 
> not exist: 
> hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:355)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641)
>   at 
> org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> 

[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently

2021-04-06 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1770:
--
Affects Version/s: 0.8.0

> Deltastreamer throws errors when not running frequently
> ---
>
> Key: HUDI-1770
> URL: https://issues.apache.org/jira/browse/HUDI-1770
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.8.0
>Reporter: Vinoth Govindarajan
>Priority: Major
>
> When delta streamer is using HoodieIncrSource from another parent Hudi table, 
> it runs into this error, when you are not running your delta streamer 
> pipeline frequently.
>  
> {code:java}
> User class threw exception: org.apache.spark.sql.AnalysisException: Path does 
> not exist: 
> hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:355)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641)
>   at 
> org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> 

[jira] [Updated] (HUDI-1770) Deltastreamer throws errors when not running frequently

2021-04-06 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1770:
--
Description: 
When delta streamer is using HoodieIncrSource from another parent Hudi table, 
it runs into this error, when you are not running your delta streamer pipeline 
frequently.

 
{code:java}
User class threw exception: org.apache.spark.sql.AnalysisException: Path does 
not exist: 
hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641)
at 
org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at 

[jira] [Created] (HUDI-1770) Deltastreamer throws errors when not running frequently

2021-04-06 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1770:
-

 Summary: Deltastreamer throws errors when not running frequently
 Key: HUDI-1770
 URL: https://issues.apache.org/jira/browse/HUDI-1770
 Project: Apache Hudi
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: Vinoth Govindarajan


When delta streamer is using HoodieIncrSource from another parent Hudi table, 
it runs into this error:

 
{code:java}
User class threw exception: org.apache.spark.sql.AnalysisException: Path does 
not exist: 
hdfs:///tmp/delta_streamer_test/datestr=2021-03-30/f64f3420-4e03-4835-ab06-5d73cb953aa9-0_3-4-91_20210402163524.parquet;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:641)
at 
org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:306)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
   

[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1762:
--
Status: Patch Available  (was: In Progress)

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1762:
--
Status: In Progress  (was: Open)

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-04 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1762:
-

 Summary: Hive Sync is not working with Hive Style Partitioning
 Key: HUDI-1762
 URL: https://issues.apache.org/jira/browse/HUDI-1762
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


When you create a Hudi table with hive style partitioning and enable the hive 
sync, it didn't work because it's assuming the partition will be separated by a 
slash.

 

when the hive style partitioning is enabled for the target table like this:
{code:java}
hoodie.datasource.write.partitionpath.field=datestr
hoodie.datasource.write.hive_style_partitioning=true
{code}
This is the error it throws:
{code:java}
21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
delta sync once. Shutting down
org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
syncing delta_streamer_test
at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
partitions for table fact_scheduled_trip__1pc_trip_uuid
at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
... 12 more
Caused by: java.lang.IllegalArgumentException: Partition path 
datestr=2021-03-28 is not in the form /mm/dd 
at 
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
at 
org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
... 14 more
{code}
To fix this issue we need to create a new partition extractor class and assign 
that class name as the hive sync partition extractor.

After you define the new partition extractor class, you can configure it like 
this:
{code:java}
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table

2021-03-31 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1747:
-

 Summary: Deltastreamer incremental read is not working on the MOR 
table
 Key: HUDI-1747
 URL: https://issues.apache.org/jira/browse/HUDI-1747
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Reporter: Vinoth Govindarajan


I was trying to read the MOR HUDI table incrementally using delta streamer, 
while doing that I ran into this issue where it says:
{code:java}
Found recursive reference in Avro schema, which can not be processed by 
Spark:{code}
Spark Version: 2.4

Hudi Version: 0.7.0-SNAPSHOT or the latest master

 

Full Stack Trace:
{code:java}
Found recursive reference in Avro schema, which can not be processed by Spark:
{
  "type" : "record",
  "name" : "meta",
  "fields" : [ {
"name" : "verified",
"type" : [ "null", "boolean" ],
"default" : null
  }, {
"name" : "zip",
"type" : [ "null", "string" ],
"default" : null
  }, {
"name" : "lname",
"type" : [ "null", "string" ],
"default" : null
  }]
}
  
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 

[jira] [Created] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-03-30 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1743:
-

 Summary: Add support for Spark SQL File based transformer for 
deltastreamer
 Key: HUDI-1743
 URL: https://issues.apache.org/jira/browse/HUDI-1743
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan


The current SQLQuery based transformer is limited in functionality, you can't 
pass multiple Spark SQL statements separated by a semicolon which is necessary 
if your transformation is complex.

 

The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file as 
input with multiple Spark SQL statements and applies the transformation to the 
delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-783) Add official python support to create hudi datasets using pyspark

2021-03-30 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-783.


> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect

2021-03-16 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan resolved HUDI-1695.
---
Resolution: Fixed

> Deltastreamer HoodieIncrSource exception error messaging is incorrect
> -
>
> Key: HUDI-1695
> URL: https://issues.apache.org/jira/browse/HUDI-1695
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Trivial
>  Labels: beginner, pull-request-available
> Fix For: 0.8.0
>
>
> When you set your source_class as HoodieIncrSource and invoke deltastreamer 
> without any checkpoint, it throws the following Exception:
>  
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code}
>  
> The error messaging is wrong and misleading, the correct parameter is:
> {code:java}
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt
> {code}
> Check out the correct parameter in this 
> [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78]
>  
> The correct messaging should be:
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect

2021-03-16 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan closed HUDI-1695.
-

> Deltastreamer HoodieIncrSource exception error messaging is incorrect
> -
>
> Key: HUDI-1695
> URL: https://issues.apache.org/jira/browse/HUDI-1695
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Trivial
>  Labels: beginner, pull-request-available
> Fix For: 0.8.0
>
>
> When you set your source_class as HoodieIncrSource and invoke deltastreamer 
> without any checkpoint, it throws the following Exception:
>  
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code}
>  
> The error messaging is wrong and misleading, the correct parameter is:
> {code:java}
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt
> {code}
> Check out the correct parameter in this 
> [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78]
>  
> The correct messaging should be:
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect

2021-03-15 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302235#comment-17302235
 ] 

Vinoth Govindarajan commented on HUDI-1695:
---

PR has been merged.

> Deltastreamer HoodieIncrSource exception error messaging is incorrect
> -
>
> Key: HUDI-1695
> URL: https://issues.apache.org/jira/browse/HUDI-1695
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Trivial
>  Labels: beginner, pull-request-available
> Fix For: 0.8.0
>
>
> When you set your source_class as HoodieIncrSource and invoke deltastreamer 
> without any checkpoint, it throws the following Exception:
>  
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code}
>  
> The error messaging is wrong and misleading, the correct parameter is:
> {code:java}
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt
> {code}
> Check out the correct parameter in this 
> [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78]
>  
> The correct messaging should be:
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1695) Deltastreamer HoodieIncrSource exception error messaging is incorrect

2021-03-15 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1695:
--
Summary: Deltastreamer HoodieIncrSource exception error messaging is 
incorrect  (was: Deltastream HoodieIncrSource exception error messaging is 
incorrect)

> Deltastreamer HoodieIncrSource exception error messaging is incorrect
> -
>
> Key: HUDI-1695
> URL: https://issues.apache.org/jira/browse/HUDI-1695
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Trivial
>  Labels: beginner
> Fix For: 0.8.0
>
>
> When you set your source_class as HoodieIncrSource and invoke deltastreamer 
> without any checkpoint, it throws the following Exception:
>  
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code}
>  
> The error messaging is wrong and misleading, the correct parameter is:
> {code:java}
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt
> {code}
> Check out the correct parameter in this 
> [file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78]
>  
> The correct messaging should be:
> {code:java}
> User class threw exception: java.lang.IllegalArgumentException: Missing begin 
> instant for incremental pull. For reading from latest committed instant set 
> hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1695) Deltastream HoodieIncrSource exception error messaging is incorrect

2021-03-15 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-1695:
-

 Summary: Deltastream HoodieIncrSource exception error messaging is 
incorrect
 Key: HUDI-1695
 URL: https://issues.apache.org/jira/browse/HUDI-1695
 Project: Apache Hudi
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: Vinoth Govindarajan
Assignee: Vinoth Govindarajan
 Fix For: 0.8.0


When you set your source_class as HoodieIncrSource and invoke deltastreamer 
without any checkpoint, it throws the following Exception:

 
{code:java}
User class threw exception: java.lang.IllegalArgumentException: Missing begin 
instant for incremental pull. For reading from latest committed instant set 
hoodie.deltastreamer.source.hoodie.read_latest_on_midding_ckpt to true{code}
 

The error messaging is wrong and misleading, the correct parameter is:
{code:java}
hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt
{code}
Check out the correct parameter in this 
[file|https://github.com/apache/hudi/blob/release-0.7.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java#L78]

 

The correct messaging should be:
{code:java}
User class threw exception: java.lang.IllegalArgumentException: Missing begin 
instant for incremental pull. For reading from latest committed instant set 
hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt to true
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-825) Write a small blog on how to use hudi-spark with pyspark

2021-01-26 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-825:
-
Status: Open  (was: New)

> Write a small blog on how to use hudi-spark with pyspark
> 
>
> Key: HUDI-825
> URL: https://issues.apache.org/jira/browse/HUDI-825
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: user-support-issues
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi

2021-01-26 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272397#comment-17272397
 ] 

Vinoth Govindarajan commented on HUDI-824:
--

[~nagarwal] - All the apache projects are available directly to use with 
`–packages` option, I tried with pyspark it worked:


{code:java}
spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}
 

The same instructions have been updated in the following doc:
[https://hudi.apache.org/docs/quick-start-guide.html]

 

No further action need, let me know if its okay to close this issue.

 

> Register hudi-spark package with spark packages repo for easier usage of Hudi
> -
>
> Key: HUDI-824
> URL: https://issues.apache.org/jira/browse/HUDI-824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: user-support-issues
>
> At the moment, to be able to use Hudi with spark, users have to do the 
> following : 
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>   --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` 
> \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> {{}}
> {{Ideally, we want to be able to use Hudi as follows :}}
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages 
> org.apache.hudi:hudi-spark-bundle: \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-824) Register hudi-spark package with spark packages repo for easier usage of Hudi

2021-01-26 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-824:
-
Status: Open  (was: New)

> Register hudi-spark package with spark packages repo for easier usage of Hudi
> -
>
> Key: HUDI-824
> URL: https://issues.apache.org/jira/browse/HUDI-824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: user-support-issues
>
> At the moment, to be able to use Hudi with spark, users have to do the 
> following : 
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
>   --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` 
> \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> {{}}
> {{Ideally, we want to be able to use Hudi as follows :}}
>  
> {{spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages 
> org.apache.hudi:hudi-spark-bundle: \
>   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-06-10 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130943#comment-17130943
 ] 

Vinoth Govindarajan commented on HUDI-783:
--

Status update:
 * -Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.- Done
 ** Here is the documentation: 
[https://hudi.apache.org/docs/quick-start-guide.html#pyspark-example]
 ** There is a separate ticket for blog post - HUDI-825
 * -Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.- Done
 ** pyspark is now supported in hudi demo docker, here is the 
[PR|[https://github.com/apache/hudi/pull/1632]]
 * -Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|https://pypi.org/]- Done
 ** Since hudi is already part of apache project, we can directly use it as a 
package:
 ** 
{code:java}
export PYSPARK_PYTHON=$(which python3)
spark-2.4.4-bin-hadoop2.7/bin/pyspark \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-16 Thread Vinoth Govindarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085253#comment-17085253
 ] 

Vinoth Govindarajan commented on HUDI-783:
--

Thanks, [~vinoth]!

We don't need to write any wrapper code in python to use it in pyspark, the 
existing jar files can be packaged and used in pyspark using the data source 
API method. 

To improve the client experience, we need to register the hudi-spark bundle as 
a package on the [https://spark-packages.org/register] website. 
[spark-packages.org|https://spark-packages.org/] is an external, 
community-managed list of third-party libraries, add-ons, and applications that 
work with Apache Spark. 

To register a package, Its content must be hosted by 
[GitHub|https://github.com/] in a public repo under the owner's account, I 
guess you can register the package since you are the owner of the repo. 
Thoughts?

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >