[jira] [Created] (HUDI-5826) Add docs for how to use Hudi CLI on GCP
Pramod Biligiri created HUDI-5826: - Summary: Add docs for how to use Hudi CLI on GCP Key: HUDI-5826 URL: https://issues.apache.org/jira/browse/HUDI-5826 Project: Apache Hudi Issue Type: Improvement Components: docs Reporter: Pramod Biligiri If a user wants to set up and run Hudi CLI on a GCP Dataproc node, currently there is no clear documentation for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5826) Add docs for how to use Hudi CLI on GCP Dataproc
[ https://issues.apache.org/jira/browse/HUDI-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-5826: -- Summary: Add docs for how to use Hudi CLI on GCP Dataproc (was: Add docs for how to use Hudi CLI on GCP) > Add docs for how to use Hudi CLI on GCP Dataproc > > > Key: HUDI-5826 > URL: https://issues.apache.org/jira/browse/HUDI-5826 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Pramod Biligiri >Priority: Major > Labels: documentation > > If a user wants to set up and run Hudi CLI on a GCP Dataproc node, currently > there is no clear documentation for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5806) hudi-cli should have option to show nearest matching commit
Pramod Biligiri created HUDI-5806: - Summary: hudi-cli should have option to show nearest matching commit Key: HUDI-5806 URL: https://issues.apache.org/jira/browse/HUDI-5806 Project: Apache Hudi Issue Type: Improvement Components: cli Reporter: Pramod Biligiri When searching for a commit timestamp in hudi cli, there should be an option to display the nearest matching commits if no exact match is found. This will help in production support use cases to quickly know what was the recent commit activity in the period in which the user is interested in. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5804) hudi-cli CommitsCommand - some options fail due to typo in ShellOption annotation
[ https://issues.apache.org/jira/browse/HUDI-5804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-5804: -- Description: In multiple places in the CommitsCommand, the ShellOption is missing the "–" parameter in its value attribute. One such example is shown below from "commit showpartitions": [https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213] |@ShellOption(value = \{"includeArchivedTimeline"}, help = "Include archived commits as well", defaultValue = "false") final boolean includeArchivedTimeline)| In the above, it should read 'value=\{"--includeArchivedTimeline"...}' was: In multiple places in the CommitsCommand, the ShellOption is missing the "–" parameter in its value attribute. One such example is shown below from "commit showpartitions": [https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213] |@ShellOption(value = {"includeArchivedTimeline"}, help = "Include archived commits as well", defaultValue = "false") final boolean includeArchivedTimeline)| That should read value=\{"--includeArchivedTimeline"} > hudi-cli CommitsCommand - some options fail due to typo in ShellOption > annotation > - > > Key: HUDI-5804 > URL: https://issues.apache.org/jira/browse/HUDI-5804 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Reporter: Pramod Biligiri >Priority: Minor > > In multiple places in the CommitsCommand, the ShellOption is missing the "–" > parameter in its value attribute. One such example is shown below from > "commit showpartitions": > [https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213] > |@ShellOption(value = \{"includeArchivedTimeline"}, help = "Include archived > commits as well", defaultValue = "false") final boolean > includeArchivedTimeline)| > In the above, it should read 'value=\{"--includeArchivedTimeline"...}' > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5804) hudi-cli CommitsCommand - some options fail due to typo in ShellOption annotation
Pramod Biligiri created HUDI-5804: - Summary: hudi-cli CommitsCommand - some options fail due to typo in ShellOption annotation Key: HUDI-5804 URL: https://issues.apache.org/jira/browse/HUDI-5804 Project: Apache Hudi Issue Type: Bug Components: cli Reporter: Pramod Biligiri In multiple places in the CommitsCommand, the ShellOption is missing the "–" parameter in its value attribute. One such example is shown below from "commit showpartitions": [https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213] |@ShellOption(value = {"includeArchivedTimeline"}, help = "Include archived commits as well", defaultValue = "false") final boolean includeArchivedTimeline)| That should read value=\{"--includeArchivedTimeline"} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5719) Add docs for hudi-cli "show restores" feature
Pramod Biligiri created HUDI-5719: - Summary: Add docs for hudi-cli "show restores" feature Key: HUDI-5719 URL: https://issues.apache.org/jira/browse/HUDI-5719 Project: Apache Hudi Issue Type: Task Components: docs Reporter: Pramod Biligiri Once the hudi-cli "show restores" feature is accepted (https://issues.apache.org/jira/browse/HUDI-1593 is considered done), add documentation for the same to the website and wherever else required. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null
[ https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684486#comment-17684486 ] Pramod Biligiri commented on HUDI-5688: --- A small workaround for the null value, that shows that the bug diagnosis is valid: [https://github.com/apache/hudi/pull/7864] Not sure if the above change can be considered a fix to the issue. > schema field of EmptyRelation subtype of BaseRelation should not be null > > > Key: HUDI-5688 > URL: https://issues.apache.org/jira/browse/HUDI-5688 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: Pramod Biligiri >Priority: Major > Labels: pull-request-available > Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, > 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, > Main.java, pom.xml > > > If there are no completed instants in the table, and there is no user defined > schema for it as well (as represented by the userSpecifiedSchema field in > DataSource.scala), then the EmptyRelation returned by > DefaultSource.createRelation sets schema of the EmptyRelation to null. This > breaks the contract of Spark's BaseRelation, where the schema is a StructType > but is not expected to be null. > Module versions: current apache-hudi master (commit hash > abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12. > Following Hudi session reproduces the above issue: > spark.read.format("hudi") > .option("hoodie.datasource.query.type", "incremental") > .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA") > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188) > ... 50 elided > Find attached a few screenshots which show the code flow and the buggy state > of the variables. Also find attached a Java file and pom.xml that can be used > to reproduce the same (sorry don't have deanonymized table -to share yet).- > The bug seems to have been introduced in this particular PR change: > [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220] > Initial work on that file has happened in this particular Jira > (https://issues.apache.org/jira/browse/HUDI-4363) and PR > (https://github.com/apache/hudi/pull/6046) respectively. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null
Pramod Biligiri created HUDI-5688: - Summary: schema field of EmptyRelation subtype of BaseRelation should not be null Key: HUDI-5688 URL: https://issues.apache.org/jira/browse/HUDI-5688 Project: Apache Hudi Issue Type: Bug Components: core Reporter: Pramod Biligiri Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, Main.java, pom.xml If there are no completed instants in the table, and there is no user defined schema for it as well (as represented by the userSpecifiedSchema field in DataSource.scala), then the EmptyRelation returned by DefaultSource.createRelation sets schema of the EmptyRelation to null. This breaks the contract of Spark's BaseRelation, where the schema is a StructType but is not expected to be null. Module versions: current apache-hudi master (commit hash abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12. Following Hudi session reproduces the above issue: spark.read.format("hudi") .option("hoodie.datasource.query.type", "incremental") .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA") java.lang.NullPointerException at org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41) at org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76) at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188) ... 50 elided Find attached a few screenshots which show the code flow and the buggy state of the variables. Also find attached a Java file and pom.xml that can be used to reproduce the same (sorry don't have deanonymized table -to share yet).- The bug seems to have been introduced in this particular PR change: [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220] Initial work on that file has happened in this particular Jira (https://issues.apache.org/jira/browse/HUDI-4363) and PR (https://github.com/apache/hudi/pull/6046) respectively. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5650) Add source data estimators to optimize ingestion runs
Pramod Biligiri created HUDI-5650: - Summary: Add source data estimators to optimize ingestion runs Key: HUDI-5650 URL: https://issues.apache.org/jira/browse/HUDI-5650 Project: Apache Hudi Issue Type: New Feature Components: deltastreamer Reporter: Pramod Biligiri Estimate how much new data is present to be ingested from a given data source, and schedule DeltaStreamer jobs based on that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5024) Support storing database also as a Dataset in Datahub, not just a table
Pramod Biligiri created HUDI-5024: - Summary: Support storing database also as a Dataset in Datahub, not just a table Key: HUDI-5024 URL: https://issues.apache.org/jira/browse/HUDI-5024 Project: Apache Hudi Issue Type: Task Components: meta-sync Reporter: Pramod Biligiri Note: Evaluate feasibility and desirability of this before implementing. Hudi's DatahubSyncTool only pushes tables as a Dataset into Datahub, and not the database itself as a Dataset. Moreover, Datahub also appears (on the face of it) to only store tables as a Dataset, and not the database itself. This is shown even in their demo page: [https://demo.datahubproject.io/browse/dataset/prod/postgres/calm-pagoda-323403/jaffle_shop] But some customers might want to store the Database also as a top-level entity. So consider enhancing DatahubSyncTool to do the same - probably using some advanced features of Datahub? Ongoing Slack thread about this in Datahub Slack: https://datahubspace.slack.com/archives/CUMUWQU66/p1665636994736379 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5009) Enabling asynchronous processing of Metastore Sync
[ https://issues.apache.org/jira/browse/HUDI-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-5009: -- Summary: Enabling asynchronous processing of Metastore Sync (was: Enabling invoking runMetaSync() asynchronously) > Enabling asynchronous processing of Metastore Sync > -- > > Key: HUDI-5009 > URL: https://issues.apache.org/jira/browse/HUDI-5009 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Minor > > Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, > and iterates over the different Metastores sequentially - ([code > link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695] > within 0.12.0 branch) > And runMetaSync() is invoked during each commit, which can lead to a slow > down of commit flow if many metastores are being synced. So enable async > invocation of runMetaSync(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5009) Enabling invoking runMetaSync() asynchronously
Pramod Biligiri created HUDI-5009: - Summary: Enabling invoking runMetaSync() asynchronously Key: HUDI-5009 URL: https://issues.apache.org/jira/browse/HUDI-5009 Project: Apache Hudi Issue Type: Task Components: meta-sync Reporter: Pramod Biligiri Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and iterates over the different Metastores sequentially - ([code link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695] within 0.12.0 branch) And runMetaSync() is invoked during each commit, which can lead to a slow down of commit flow if many metastores are being synced. So enable async invocation of runMetaSync(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5009) Enabling invoking runMetaSync() asynchronously
[ https://issues.apache.org/jira/browse/HUDI-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-5009: -- Description: Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and iterates over the different Metastores sequentially - ([code link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695] within 0.12.0 branch) And runMetaSync() is invoked during each commit, which can lead to a slow down of commit flow if many metastores are being synced. So enable async invocation of runMetaSync(). was: Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and iterates over the different Metastores sequentially - ([code link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695] within 0.12.0 branch) And runMetaSync() is invoked during each commit, which can lead to a slow down of commit flow if many metastores are being synced. So enable async invocation of runMetaSync(). > Enabling invoking runMetaSync() asynchronously > -- > > Key: HUDI-5009 > URL: https://issues.apache.org/jira/browse/HUDI-5009 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Minor > > Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, > and iterates over the different Metastores sequentially - ([code > link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695] > within 0.12.0 branch) > And runMetaSync() is invoked during each commit, which can lead to a slow > down of commit flow if many metastores are being synced. So enable async > invocation of runMetaSync(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4994) DatahubSyncTool does not correctly re-ingest soft-deleted entities
[ https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-4994: -- Description: Datahub has a notion of soft-deletes (the entity still exists in the database with a status=removed:true). Such entities could get re-ingested with new properties at a later time, such that the older one gets overwritten. The current implementation in DatahubSyncTool does not handle this scenario. It fails to update the status flag to removed:false during ingest, which means the entity won't surface in the Datahub UI at all. Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default] was: When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI. For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway). Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default Summary: DatahubSyncTool does not correctly re-ingest soft-deleted entities (was: DatahubSyncTool should set "removed" status of an entity to false when updating it) > DatahubSyncTool does not correctly re-ingest soft-deleted entities > -- > > Key: HUDI-4994 > URL: https://issues.apache.org/jira/browse/HUDI-4994 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Major > Labels: pull-request-available > > Datahub has a notion of soft-deletes (the entity still exists in the database > with a status=removed:true). Such entities could get re-ingested with new > properties at a later time, such that the older one gets overwritten. The > current implementation in DatahubSyncTool does not handle this scenario. It > fails to update the status flag to removed:false during ingest, which means > the entity won't surface in the Datahub UI at all. > Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: > [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4994) DatahubSyncTool should set "removed" status of an entity to false when updating it
[ https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-4994: -- Description: When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI. For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway). Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default was: When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI. For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway). > DatahubSyncTool should set "removed" status of an entity to false when > updating it > -- > > Key: HUDI-4994 > URL: https://issues.apache.org/jira/browse/HUDI-4994 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Major > Labels: pull-request-available > > When DatahubSyncTool updates an entity in Datahub using an UPSERT request of > their RestEmiiter client, it can be assumed that the entity is no longer > considered deleted, and needs to be discoverable henceforth in the Datahub UI. > For that, it is necessary to explicitly set the "status" metadata aspect of > the entity to "\{'removed':false}". This will handle the situation where the > entity may have been (soft) deleted in the past. The addition of this > "removed:false" for "status" aspect has no impact on newly created entities, > or hard-deleted entities (of which no trace remains anyway). > Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: > https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4993) Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool
[ https://issues.apache.org/jira/browse/HUDI-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-4993: -- Component/s: meta-sync > Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool > > > Key: HUDI-4993 > URL: https://issues.apache.org/jira/browse/HUDI-4993 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Major > > The name of the Datahub DataPlatform to use and the environment of the > Datahub Dataset (DEV/PROD...etc) are currently hardcoded inside > HoodieDatasetIdentifier - > [https://github.com/apache/hudi/blob/release-0.12.0/hudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/config/HoodieDataHubDatasetIdentifier.java#L47-L49] > Allow for these two to be customized. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4994) DatahubSyncTool should set "removed" status of an entity to false when updating it
Pramod Biligiri created HUDI-4994: - Summary: DatahubSyncTool should set "removed" status of an entity to false when updating it Key: HUDI-4994 URL: https://issues.apache.org/jira/browse/HUDI-4994 Project: Apache Hudi Issue Type: Task Components: meta-sync Reporter: Pramod Biligiri When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI. For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4993) Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool
Pramod Biligiri created HUDI-4993: - Summary: Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool Key: HUDI-4993 URL: https://issues.apache.org/jira/browse/HUDI-4993 Project: Apache Hudi Issue Type: Task Reporter: Pramod Biligiri The name of the Datahub DataPlatform to use and the environment of the Datahub Dataset (DEV/PROD...etc) are currently hardcoded inside HoodieDatasetIdentifier - [https://github.com/apache/hudi/blob/release-0.12.0/hudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/config/HoodieDataHubDatasetIdentifier.java#L47-L49] Allow for these two to be customized. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4931) Explore fat jar option for gcs-connector lib used during GCS Ingestion
[ https://issues.apache.org/jira/browse/HUDI-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610422#comment-17610422 ] Pramod Biligiri commented on HUDI-4931: --- Some useful references regarding this: - GCP docs on Cloud Storage connector: [https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage] - Hudi docs on GCS connectivity: https://hudi.apache.org/docs/gcs_hoodie/ > Explore fat jar option for gcs-connector lib used during GCS Ingestion > -- > > Key: HUDI-4931 > URL: https://issues.apache.org/jira/browse/HUDI-4931 > Project: Apache Hudi > Issue Type: Task >Reporter: Pramod Biligiri >Priority: Major > > Currently, the GCS Ingestion (HUDI-4850) expects recent versions of Jars like > protobuf and Guava to be provided to spark-submit explicitly, to override > older versions shipped with Spark. These Jars are used by the gcs-connector > which is a library from Google that helps connect to GCS. For more details > see > [https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#] > (section titled "Configure Spark to use newer versions of some Jars"). > See if it's possible to create a shaded+fat jar of gcs-connector for this use > case instead, and avoid specifying things to spark-submit on the command line. > An alternate approach to consider for the long term is HUDI-4930 (slim > bundles). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4931) Explore fat jar option for gcs-connector lib used during GCS Ingestion
Pramod Biligiri created HUDI-4931: - Summary: Explore fat jar option for gcs-connector lib used during GCS Ingestion Key: HUDI-4931 URL: https://issues.apache.org/jira/browse/HUDI-4931 Project: Apache Hudi Issue Type: Task Reporter: Pramod Biligiri Currently, the GCS Ingestion (HUDI-4850) expects recent versions of Jars like protobuf and Guava to be provided to spark-submit explicitly, to override older versions shipped with Spark. These Jars are used by the gcs-connector which is a library from Google that helps connect to GCS. For more details see [https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#] (section titled "Configure Spark to use newer versions of some Jars"). See if it's possible to create a shaded+fat jar of gcs-connector for this use case instead, and avoid specifying things to spark-submit on the command line. An alternate approach to consider for the long term is HUDI-4930 (slim bundles). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4930) Create a bundle with all GCS deps that works with utilities-slim and engine bundle (spark or flink)
Pramod Biligiri created HUDI-4930: - Summary: Create a bundle with all GCS deps that works with utilities-slim and engine bundle (spark or flink) Key: HUDI-4930 URL: https://issues.apache.org/jira/browse/HUDI-4930 Project: Apache Hudi Issue Type: Task Reporter: Pramod Biligiri Currently, GCS deps are explicitly invoked within hudi-utilities POM and when invoking GCS Ingestion (a fat jar is not used). Instead, create a bundle with all GCS deps that works with utilities-slim and engine bundle (spark or flink) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4929) Refactor code that is common to all ingestions from cloud sources
Pramod Biligiri created HUDI-4929: - Summary: Refactor code that is common to all ingestions from cloud sources Key: HUDI-4929 URL: https://issues.apache.org/jira/browse/HUDI-4929 Project: Apache Hudi Issue Type: Task Reporter: Pramod Biligiri Currently, there are features to ingest incrementally from S3 (HUDI-1897) and GCS (HUDI-4850). Refactor common logic used across both. This will help in easier implementation of future cloud based ingestion sources. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4928) Use common configs for ingestion from S3, GCS etc
Pramod Biligiri created HUDI-4928: - Summary: Use common configs for ingestion from S3, GCS etc Key: HUDI-4928 URL: https://issues.apache.org/jira/browse/HUDI-4928 Project: Apache Hudi Issue Type: Task Reporter: Pramod Biligiri Currently, incremental ingestion is supported from S3 (HUDI-1897) and GCS (HUDI-4850). Normalize the config params that are common to both. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4927) GCS Ingestion supports only new file uploads, no deletion and repeated uploads
Pramod Biligiri created HUDI-4927: - Summary: GCS Ingestion supports only new file uploads, no deletion and repeated uploads Key: HUDI-4927 URL: https://issues.apache.org/jira/browse/HUDI-4927 Project: Apache Hudi Issue Type: Bug Reporter: Pramod Biligiri The GCS Ingestion (https://issues.apache.org/jira/browse/HUDI-4850) supports only events related to new files which are being uploaded for the first time. Specifically, it does not detect files being deleted, or the same file being uploaded repeatedly. GCS even has a notion of Object Versioning, which is also not supported. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4850) Implement DeltaStreamer Source for Google Cloud Storage
Pramod Biligiri created HUDI-4850: - Summary: Implement DeltaStreamer Source for Google Cloud Storage Key: HUDI-4850 URL: https://issues.apache.org/jira/browse/HUDI-4850 Project: Apache Hudi Issue Type: Task Components: deltastreamer Reporter: Pramod Biligiri Fix For: 0.13.0 It should be possible to reliably ingest data from GCS buckets into Hudi using a Deltastreamer Source. Such a feature already exists to ingest from AWS S3 buckets, as discussed in HUDI-1897 and described in a Hudi blog post: https://hudi.apache.org/blog/2021/08/23/s3-events-source/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4819) run_sync_tool.sh in hudi-hive-sync fails with classpath errors on release-0.12.0
Pramod Biligiri created HUDI-4819: - Summary: run_sync_tool.sh in hudi-hive-sync fails with classpath errors on release-0.12.0 Key: HUDI-4819 URL: https://issues.apache.org/jira/browse/HUDI-4819 Project: Apache Hudi Issue Type: Bug Components: hive, meta-sync Affects Versions: 0.12.0 Reporter: Pramod Biligiri Attachments: modified_run_sync_tool.sh I ran the run_sync_tool.sh script after git cloning and building a new instance of apache-hudi (branch: release-0.12.0). The script failed with classpath related errors. Find below the relevant sequence of commands I used: $ git branch * (HEAD detached at release-0.12.0) $ mvn -Dspark3.2 -Dscala-2.12 -DskipTests -Dcheckstyle.skip -Drat.skip clean install $ echo $HADOOP_HOME /home/pramod/2installers/hadoop-2.7.4 $ echo $HIVE_HOME /home/pramod/2installers/apache-hive-3.1.3-bin $ /run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:1 --partitioned-by bucket --base-path /2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table gcs_meta_hive_4 > log.out 2>&1 setting hadoop conf dir Running Command : java -cp /home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-metastore-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-service-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-exec-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-3.1.3.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-handler-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-annotations-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-databind-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-dataformat-smile-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-mapper-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-module-scala_2.11-2.12.0.jar::/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/mapreduce/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/lib/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/lib/*:/home/pramod/2installers/hadoop-2.7.4/etc/hadoop:/3-pramod/3workspace/apache-hudi/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.12.0.jar org.apache.hudi.hive.HiveSyncTool --jdbc-url jdbc:hive2://hiveserver:1 --partitioned-by bucket --base-path /2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table gcs_meta_hive_4 2022-09-08 10:53:24,335 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/home/pramod/2installers/apache-hive-3.1.3-bin/conf/hive-site.xml WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/2-pramod/installers/hadoop-2.7.4/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 2022-09-08 10:53:25,876 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2022-09-08 10:53:26,359 INFO [main] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(121)) - Loading HoodieTableMetaClient from /2-pramod/tmp/gcs-integration-test/data/meta-gcs 2022-09-08 10:53:26,568 INFO [main] table.HoodieTableConfig (HoodieTableConfig.java:(243)) - Loading table properties from /2-pramod/tmp/gcs-integration-test/data/meta-gcs/.hoodie/hoodie.properties 2022-09-08 10:53:26,585 INFO [main] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(140)) - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from /2-pramod/tmp/gcs-integration-test/data/meta-gcs 2022-09-08 10:53:26,586 INFO [main] table.HoodieTableMetaClient (HoodieTableMetaClient.java:(143)) - Loading Active commit timeline for /2-pramod/tmp/gcs-integration-test/data/meta-gcs 2022-09-08 10:53:26,727 INFO [main] timeline.HoodieActiveTimeline (HoodieActiveTimeline.java:(129)) - Loaded instants upto : Option\{val=[20220907220948700__commit__COMPLETED]} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/config/Lookup at org.apache.hive.jdbc.HiveDriver.conn