[jira] [Created] (HUDI-5826) Add docs for how to use Hudi CLI on GCP

2023-02-21 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5826:
-

 Summary: Add docs for how to use Hudi CLI on GCP
 Key: HUDI-5826
 URL: https://issues.apache.org/jira/browse/HUDI-5826
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: Pramod Biligiri


If a user wants to set up and run Hudi CLI on a GCP Dataproc node, currently 
there is no clear documentation for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5826) Add docs for how to use Hudi CLI on GCP Dataproc

2023-02-21 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-5826:
--
Summary: Add docs for how to use Hudi CLI on GCP Dataproc  (was: Add docs 
for how to use Hudi CLI on GCP)

> Add docs for how to use Hudi CLI on GCP Dataproc
> 
>
> Key: HUDI-5826
> URL: https://issues.apache.org/jira/browse/HUDI-5826
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: documentation
>
> If a user wants to set up and run Hudi CLI on a GCP Dataproc node, currently 
> there is no clear documentation for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5806) hudi-cli should have option to show nearest matching commit

2023-02-15 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5806:
-

 Summary: hudi-cli should have option to show nearest matching 
commit
 Key: HUDI-5806
 URL: https://issues.apache.org/jira/browse/HUDI-5806
 Project: Apache Hudi
  Issue Type: Improvement
  Components: cli
Reporter: Pramod Biligiri


When searching for a commit timestamp in hudi cli, there should be an option to 
display the nearest matching commits if no exact match is found. This will help 
in production support use cases to quickly know what was the recent commit 
activity in the period in which the user is interested in.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5804) hudi-cli CommitsCommand - some options fail due to typo in ShellOption annotation

2023-02-15 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-5804:
--
Description: 
In multiple places in the CommitsCommand, the ShellOption is missing the "–" 
parameter in its value attribute. One such example is shown below from "commit 
showpartitions":

[https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213]
|@ShellOption(value = \{"includeArchivedTimeline"}, help = "Include archived 
commits as well", defaultValue = "false") final boolean 
includeArchivedTimeline)|

In the above, it should read 'value=\{"--includeArchivedTimeline"...}'

 

 

  was:
In multiple places in the CommitsCommand, the ShellOption is missing the "–" 
parameter in its value attribute. One such example is shown below from "commit 
showpartitions":

[https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213]
|@ShellOption(value = {"includeArchivedTimeline"}, help = "Include archived 
commits as well", defaultValue = "false") final boolean 
includeArchivedTimeline)|


That should read value=\{"--includeArchivedTimeline"}

 

 


> hudi-cli CommitsCommand - some options fail due to typo in ShellOption 
> annotation
> -
>
> Key: HUDI-5804
> URL: https://issues.apache.org/jira/browse/HUDI-5804
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Pramod Biligiri
>Priority: Minor
>
> In multiple places in the CommitsCommand, the ShellOption is missing the "–" 
> parameter in its value attribute. One such example is shown below from 
> "commit showpartitions":
> [https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213]
> |@ShellOption(value = \{"includeArchivedTimeline"}, help = "Include archived 
> commits as well", defaultValue = "false") final boolean 
> includeArchivedTimeline)|
> In the above, it should read 'value=\{"--includeArchivedTimeline"...}'
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5804) hudi-cli CommitsCommand - some options fail due to typo in ShellOption annotation

2023-02-15 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5804:
-

 Summary: hudi-cli CommitsCommand - some options fail due to typo 
in ShellOption annotation
 Key: HUDI-5804
 URL: https://issues.apache.org/jira/browse/HUDI-5804
 Project: Apache Hudi
  Issue Type: Bug
  Components: cli
Reporter: Pramod Biligiri


In multiple places in the CommitsCommand, the ShellOption is missing the "–" 
parameter in its value attribute. One such example is shown below from "commit 
showpartitions":

[https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L213]
|@ShellOption(value = {"includeArchivedTimeline"}, help = "Include archived 
commits as well", defaultValue = "false") final boolean 
includeArchivedTimeline)|


That should read value=\{"--includeArchivedTimeline"}

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5719) Add docs for hudi-cli "show restores" feature

2023-02-07 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5719:
-

 Summary: Add docs for hudi-cli "show restores" feature
 Key: HUDI-5719
 URL: https://issues.apache.org/jira/browse/HUDI-5719
 Project: Apache Hudi
  Issue Type: Task
  Components: docs
Reporter: Pramod Biligiri


Once the hudi-cli "show restores" feature is accepted 
(https://issues.apache.org/jira/browse/HUDI-1593 is considered done), add 
documentation for the same to the website and wherever else required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

2023-02-05 Thread Pramod Biligiri (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684486#comment-17684486
 ] 

Pramod Biligiri commented on HUDI-5688:
---

A small workaround for the null value, that shows that the bug diagnosis is 
valid: [https://github.com/apache/hudi/pull/7864]

Not sure if the above change can be considered a fix to the issue.

> schema field of EmptyRelation subtype of BaseRelation should not be null
> 
>
> Key: HUDI-5688
> URL: https://issues.apache.org/jira/browse/HUDI-5688
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined 
> schema for it as well (as represented by the userSpecifiedSchema field in 
> DataSource.scala), then the EmptyRelation returned by 
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
> breaks the contract of Spark's BaseRelation, where the schema is a StructType 
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash 
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
>             .option("hoodie.datasource.query.type", "incremental") 
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
>   at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
>   ... 50 elided  
> Find attached a few screenshots which show the code flow and the buggy state 
> of the variables. Also find attached a Java file and pom.xml that can be used 
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change: 
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira 
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR 
> (https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

2023-02-02 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5688:
-

 Summary: schema field of EmptyRelation subtype of BaseRelation 
should not be null
 Key: HUDI-5688
 URL: https://issues.apache.org/jira/browse/HUDI-5688
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: Pramod Biligiri
 Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
Main.java, pom.xml

If there are no completed instants in the table, and there is no user defined 
schema for it as well (as represented by the userSpecifiedSchema field in 
DataSource.scala), then the EmptyRelation returned by 
DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
breaks the contract of Spark's BaseRelation, where the schema is a StructType 
but is not expected to be null.

Module versions: current apache-hudi master (commit hash 
abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.

Following Hudi session reproduces the above issue:

spark.read.format("hudi")
            .option("hoodie.datasource.query.type", "incremental") 
.load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")

java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
  at 
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
  at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
  ... 50 elided  

Find attached a few screenshots which show the code flow and the buggy state of 
the variables. Also find attached a Java file and pom.xml that can be used to 
reproduce the same (sorry don't have deanonymized table -to share yet).-

The bug seems to have been introduced in this particular PR change: 
[https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]

Initial work on that file has happened in this particular Jira 
(https://issues.apache.org/jira/browse/HUDI-4363) and PR 
(https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5650) Add source data estimators to optimize ingestion runs

2023-01-30 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5650:
-

 Summary: Add source data estimators to optimize ingestion runs
 Key: HUDI-5650
 URL: https://issues.apache.org/jira/browse/HUDI-5650
 Project: Apache Hudi
  Issue Type: New Feature
  Components: deltastreamer
Reporter: Pramod Biligiri


Estimate how much new data is present to be ingested from a given data source, 
and schedule DeltaStreamer jobs based on that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5024) Support storing database also as a Dataset in Datahub, not just a table

2022-10-12 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5024:
-

 Summary: Support storing database also as a Dataset in Datahub, 
not just a table
 Key: HUDI-5024
 URL: https://issues.apache.org/jira/browse/HUDI-5024
 Project: Apache Hudi
  Issue Type: Task
  Components: meta-sync
Reporter: Pramod Biligiri


Note: Evaluate feasibility and desirability of this before implementing.

Hudi's DatahubSyncTool only pushes tables as a Dataset into Datahub, and not 
the database itself as a Dataset. Moreover, Datahub also appears (on the face 
of it) to only store tables as a Dataset, and not the database itself. This is 
shown even in their demo page: 
[https://demo.datahubproject.io/browse/dataset/prod/postgres/calm-pagoda-323403/jaffle_shop]

But some customers might want to store the Database also as a top-level entity. 
So consider enhancing DatahubSyncTool to do the same - probably using some 
advanced features of Datahub?

Ongoing Slack thread about this in Datahub Slack: 
https://datahubspace.slack.com/archives/CUMUWQU66/p1665636994736379



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5009) Enabling asynchronous processing of Metastore Sync

2022-10-11 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-5009:
--
Summary: Enabling asynchronous processing of Metastore Sync  (was: Enabling 
invoking runMetaSync() asynchronously)

> Enabling asynchronous processing of Metastore Sync
> --
>
> Key: HUDI-5009
> URL: https://issues.apache.org/jira/browse/HUDI-5009
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Minor
>
> Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, 
> and iterates over the different Metastores sequentially - ([code 
> link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695]
>  within 0.12.0 branch)
> And runMetaSync() is invoked during each commit, which can lead to a slow 
> down of commit flow if many metastores are being synced. So enable async 
> invocation of runMetaSync().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5009) Enabling invoking runMetaSync() asynchronously

2022-10-11 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-5009:
-

 Summary: Enabling invoking runMetaSync() asynchronously
 Key: HUDI-5009
 URL: https://issues.apache.org/jira/browse/HUDI-5009
 Project: Apache Hudi
  Issue Type: Task
  Components: meta-sync
Reporter: Pramod Biligiri


Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and 
iterates over the different Metastores sequentially - 
([code 
link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695]
 within 0.12.0 branch)

And runMetaSync() is invoked during each commit, which can lead to a slow down 
of commit flow if many metastores are being synced. So enable async invocation 
of runMetaSync().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5009) Enabling invoking runMetaSync() asynchronously

2022-10-11 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-5009:
--
Description: 
Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and 
iterates over the different Metastores sequentially - ([code 
link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695]
 within 0.12.0 branch)

And runMetaSync() is invoked during each commit, which can lead to a slow down 
of commit flow if many metastores are being synced. So enable async invocation 
of runMetaSync().

  was:
Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, and 
iterates over the different Metastores sequentially - 
([code 
link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695]
 within 0.12.0 branch)

And runMetaSync() is invoked during each commit, which can lead to a slow down 
of commit flow if many metastores are being synced. So enable async invocation 
of runMetaSync().


> Enabling invoking runMetaSync() asynchronously
> --
>
> Key: HUDI-5009
> URL: https://issues.apache.org/jira/browse/HUDI-5009
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Minor
>
> Currently, runMetaSync() invokes each Metastore Sync in a blocking fashion, 
> and iterates over the different Metastores sequentially - ([code 
> link|https://github.com/apache/hudi/blob/release-0.12.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L695]
>  within 0.12.0 branch)
> And runMetaSync() is invoked during each commit, which can lead to a slow 
> down of commit flow if many metastores are being synced. So enable async 
> invocation of runMetaSync().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4994) DatahubSyncTool does not correctly re-ingest soft-deleted entities

2022-10-07 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-4994:
--
Description: 
Datahub has a notion of soft-deletes (the entity still exists in the database 
with a status=removed:true). Such entities could get re-ingested with new 
properties at a later time, such that the older one gets overwritten. The 
current implementation in DatahubSyncTool does not handle this scenario. It 
fails to update the status flag to removed:false during ingest, which means the 
entity won't surface in the Datahub UI at all.

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
[https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]

  was:
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI.

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default

Summary: DatahubSyncTool does not correctly re-ingest soft-deleted 
entities  (was: DatahubSyncTool should set "removed" status of an entity to 
false when updating it)

> DatahubSyncTool does not correctly re-ingest soft-deleted entities
> --
>
> Key: HUDI-4994
> URL: https://issues.apache.org/jira/browse/HUDI-4994
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: pull-request-available
>
> Datahub has a notion of soft-deletes (the entity still exists in the database 
> with a status=removed:true). Such entities could get re-ingested with new 
> properties at a later time, such that the older one gets overwritten. The 
> current implementation in DatahubSyncTool does not handle this scenario. It 
> fails to update the status flag to removed:false during ingest, which means 
> the entity won't surface in the Datahub UI at all.
> Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
> [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4994) DatahubSyncTool should set "removed" status of an entity to false when updating it

2022-10-07 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-4994:
--
Description: 
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI.

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default

  was:
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI. 

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).


> DatahubSyncTool should set "removed" status of an entity to false when 
> updating it
> --
>
> Key: HUDI-4994
> URL: https://issues.apache.org/jira/browse/HUDI-4994
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: pull-request-available
>
> When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
> their RestEmiiter client, it can be assumed that the entity is no longer 
> considered deleted, and needs to be discoverable henceforth in the Datahub UI.
> For that, it is necessary to explicitly set the "status" metadata aspect of 
> the entity to "\{'removed':false}". This will handle the situation where the 
> entity may have been (soft) deleted in the past. The addition of this 
> "removed:false" for "status" aspect has no impact on newly created entities, 
> or hard-deleted entities (of which no trace remains anyway).
> Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
> https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4993) Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool

2022-10-07 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-4993:
--
Component/s: meta-sync

> Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool
> 
>
> Key: HUDI-4993
> URL: https://issues.apache.org/jira/browse/HUDI-4993
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Major
>
> The name of the Datahub DataPlatform to use and the environment of the 
> Datahub Dataset (DEV/PROD...etc) are currently hardcoded inside 
> HoodieDatasetIdentifier - 
> [https://github.com/apache/hudi/blob/release-0.12.0/hudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/config/HoodieDataHubDatasetIdentifier.java#L47-L49]
> Allow for these two to be customized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4994) DatahubSyncTool should set "removed" status of an entity to false when updating it

2022-10-07 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4994:
-

 Summary: DatahubSyncTool should set "removed" status of an entity 
to false when updating it
 Key: HUDI-4994
 URL: https://issues.apache.org/jira/browse/HUDI-4994
 Project: Apache Hudi
  Issue Type: Task
  Components: meta-sync
Reporter: Pramod Biligiri


When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI. 

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4993) Allow specifying custom DataPlatform name and Dataset env in DatahubSyncTool

2022-10-07 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4993:
-

 Summary: Allow specifying custom DataPlatform name and Dataset env 
in DatahubSyncTool
 Key: HUDI-4993
 URL: https://issues.apache.org/jira/browse/HUDI-4993
 Project: Apache Hudi
  Issue Type: Task
Reporter: Pramod Biligiri


The name of the Datahub DataPlatform to use and the environment of the Datahub 
Dataset (DEV/PROD...etc) are currently hardcoded inside HoodieDatasetIdentifier 
- 
[https://github.com/apache/hudi/blob/release-0.12.0/hudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/config/HoodieDataHubDatasetIdentifier.java#L47-L49]

Allow for these two to be customized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4931) Explore fat jar option for gcs-connector lib used during GCS Ingestion

2022-09-28 Thread Pramod Biligiri (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610422#comment-17610422
 ] 

Pramod Biligiri commented on HUDI-4931:
---

Some useful references regarding this:
- GCP docs on Cloud Storage connector: 
[https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage]
- Hudi docs on GCS connectivity: https://hudi.apache.org/docs/gcs_hoodie/

 

> Explore fat jar option for gcs-connector lib used during GCS Ingestion
> --
>
> Key: HUDI-4931
> URL: https://issues.apache.org/jira/browse/HUDI-4931
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Pramod Biligiri
>Priority: Major
>
> Currently, the GCS Ingestion (HUDI-4850) expects recent versions of Jars like 
> protobuf and Guava to be provided to spark-submit explicitly, to override 
> older versions shipped with Spark. These Jars are used by the gcs-connector 
> which is a library from Google that helps connect to GCS. For more details 
> see 
> [https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#]
>  (section titled "Configure Spark to use newer versions of some Jars").
> See if it's possible to create a shaded+fat jar of gcs-connector for this use 
> case instead, and avoid specifying things to spark-submit on the command line.
> An alternate approach to consider for the long term is HUDI-4930 (slim 
> bundles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4931) Explore fat jar option for gcs-connector lib used during GCS Ingestion

2022-09-27 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4931:
-

 Summary: Explore fat jar option for gcs-connector lib used during 
GCS Ingestion
 Key: HUDI-4931
 URL: https://issues.apache.org/jira/browse/HUDI-4931
 Project: Apache Hudi
  Issue Type: Task
Reporter: Pramod Biligiri


Currently, the GCS Ingestion (HUDI-4850) expects recent versions of Jars like 
protobuf and Guava to be provided to spark-submit explicitly, to override older 
versions shipped with Spark. These Jars are used by the gcs-connector which is 
a library from Google that helps connect to GCS. For more details see 
[https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#]
 (section titled "Configure Spark to use newer versions of some Jars").

See if it's possible to create a shaded+fat jar of gcs-connector for this use 
case instead, and avoid specifying things to spark-submit on the command line.

An alternate approach to consider for the long term is HUDI-4930 (slim bundles).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4930) Create a bundle with all GCS deps that works with utilities-slim and engine bundle (spark or flink)

2022-09-27 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4930:
-

 Summary: Create a bundle with all GCS deps that works with 
utilities-slim and engine bundle (spark or flink)
 Key: HUDI-4930
 URL: https://issues.apache.org/jira/browse/HUDI-4930
 Project: Apache Hudi
  Issue Type: Task
Reporter: Pramod Biligiri


Currently, GCS deps are explicitly invoked within hudi-utilities POM and when 
invoking GCS Ingestion (a fat jar is not used).

 

Instead, create a bundle with all GCS deps that works with utilities-slim and 
engine bundle (spark or flink)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4929) Refactor code that is common to all ingestions from cloud sources

2022-09-27 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4929:
-

 Summary: Refactor code that is common to all ingestions from cloud 
sources
 Key: HUDI-4929
 URL: https://issues.apache.org/jira/browse/HUDI-4929
 Project: Apache Hudi
  Issue Type: Task
Reporter: Pramod Biligiri


Currently, there are features to ingest incrementally from S3 (HUDI-1897) and 
GCS (HUDI-4850). Refactor common logic used across both. This will help in 
easier implementation of future cloud based ingestion sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4928) Use common configs for ingestion from S3, GCS etc

2022-09-27 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4928:
-

 Summary: Use common configs for ingestion from S3, GCS etc
 Key: HUDI-4928
 URL: https://issues.apache.org/jira/browse/HUDI-4928
 Project: Apache Hudi
  Issue Type: Task
Reporter: Pramod Biligiri


Currently, incremental ingestion is supported from S3 (HUDI-1897) and GCS 
(HUDI-4850). Normalize the config params that are common to both.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4927) GCS Ingestion supports only new file uploads, no deletion and repeated uploads

2022-09-27 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4927:
-

 Summary: GCS Ingestion supports only new file uploads, no deletion 
and repeated uploads
 Key: HUDI-4927
 URL: https://issues.apache.org/jira/browse/HUDI-4927
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Pramod Biligiri


The GCS Ingestion (https://issues.apache.org/jira/browse/HUDI-4850) supports 
only events related to new files which are being uploaded for the first time. 
Specifically, it does not detect files being deleted, or the same file being 
uploaded repeatedly. 

GCS even has a notion of Object Versioning, which is also not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4850) Implement DeltaStreamer Source for Google Cloud Storage

2022-09-15 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4850:
-

 Summary: Implement DeltaStreamer Source for Google Cloud Storage
 Key: HUDI-4850
 URL: https://issues.apache.org/jira/browse/HUDI-4850
 Project: Apache Hudi
  Issue Type: Task
  Components: deltastreamer
Reporter: Pramod Biligiri
 Fix For: 0.13.0


It should be possible to reliably ingest data from GCS buckets into Hudi using 
a Deltastreamer Source. Such a feature already exists to ingest from AWS S3 
buckets, as discussed in HUDI-1897 and described in a Hudi blog post: 
https://hudi.apache.org/blog/2021/08/23/s3-events-source/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4819) run_sync_tool.sh in hudi-hive-sync fails with classpath errors on release-0.12.0

2022-09-08 Thread Pramod Biligiri (Jira)
Pramod Biligiri created HUDI-4819:
-

 Summary: run_sync_tool.sh in hudi-hive-sync fails with classpath 
errors on release-0.12.0
 Key: HUDI-4819
 URL: https://issues.apache.org/jira/browse/HUDI-4819
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive, meta-sync
Affects Versions: 0.12.0
Reporter: Pramod Biligiri
 Attachments: modified_run_sync_tool.sh

I ran the run_sync_tool.sh script after git cloning and building a new instance 
of apache-hudi (branch: release-0.12.0). The script failed with classpath 
related errors. Find below the relevant sequence of commands I used:

$ git branch
* (HEAD detached at release-0.12.0)

$ mvn -Dspark3.2 -Dscala-2.12 -DskipTests  -Dcheckstyle.skip -Drat.skip clean 
install

$ echo $HADOOP_HOME
/home/pramod/2installers/hadoop-2.7.4

$ echo $HIVE_HOME
/home/pramod/2installers/apache-hive-3.1.3-bin

$ /run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:1 
--partitioned-by bucket --base-path 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table 
gcs_meta_hive_4 > log.out 2>&1
setting hadoop conf dir
Running Command : java -cp 
/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-metastore-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-service-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-exec-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-3.1.3.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-handler-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-annotations-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-databind-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-dataformat-smile-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-mapper-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-module-scala_2.11-2.12.0.jar::/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/mapreduce/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/lib/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/lib/*:/home/pramod/2installers/hadoop-2.7.4/etc/hadoop:/3-pramod/3workspace/apache-hudi/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.12.0.jar
 org.apache.hudi.hive.HiveSyncTool --jdbc-url jdbc:hive2://hiveserver:1 
--partitioned-by bucket --base-path 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table 
gcs_meta_hive_4
2022-09-08 10:53:24,335 INFO  [main] conf.HiveConf 
(HiveConf.java:findConfigFile(187)) - Found configuration file 
file:/home/pramod/2installers/apache-hive-3.1.3-bin/conf/hive-site.xml
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
org.apache.hadoop.security.authentication.util.KerberosUtil 
(file:/2-pramod/installers/hadoop-2.7.4/share/hadoop/common/lib/hadoop-auth-2.7.4.jar)
 to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of 
org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
2022-09-08 10:53:25,876 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
2022-09-08 10:53:26,359 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(121)) - Loading HoodieTableMetaClient from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,568 INFO  [main] table.HoodieTableConfig 
(HoodieTableConfig.java:(243)) - Loading table properties from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs/.hoodie/hoodie.properties
2022-09-08 10:53:26,585 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(140)) - Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,586 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(143)) - Loading Active commit timeline for 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,727 INFO  [main] timeline.HoodieActiveTimeline 
(HoodieActiveTimeline.java:(129)) - Loaded instants upto : 
Option\{val=[20220907220948700__commit__COMPLETED]}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/http/config/Lookup
    at org.apache.hive.jdbc.HiveDriver.conn