[GitHub] [hudi] hudi-bot commented on pull request #7559: [HUDI-5447] Adding read support for record level index in metadata table

2023-01-03 Thread GitBox


hudi-bot commented on PR #7559:
URL: https://github.com/apache/hudi/pull/7559#issuecomment-1370375541

   
   ## CI report:
   
   * bdb5a418d12df6413bf94f3ff5149224139d6892 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7539: [SUPPORT]IllegalStateException: Trying to access closed classloader

2023-01-03 Thread GitBox


yihua commented on issue #7539:
URL: https://github.com/apache/hudi/issues/7539#issuecomment-1370374380

   @hbgstc123 Thanks for raising the issue.  @danny0405 could you provide help 
here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7541: [SUPPORT] It's very slow to savepoint a table which has many (75k) partitions

2023-01-03 Thread GitBox


yihua commented on issue #7541:
URL: https://github.com/apache/hudi/issues/7541#issuecomment-1370374025

   Hi @haoxie-aws Thanks for raising this issue.  I confirm that the issue is 
due to unnecessary scanning of the metadata table when the number of partitions 
is large.  When the metadata table is enabled, due the savepoint operation, for 
each partition, the metadata table is scanned, which leads to a lot of S3 
requests.  I'm working on a fix: HUDI-5485.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5486) Update 0.12.x release notes with Long Term Support

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5486:

Story Points: 0

> Update 0.12.x release notes with Long Term Support 
> ---
>
> Key: HUDI-5486
> URL: https://issues.apache.org/jira/browse/HUDI-5486
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5485) Improve performance of savepoint with MDT

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5485:

Story Points: 3

> Improve performance of savepoint with MDT
> -
>
> Key: HUDI-5485
> URL: https://issues.apache.org/jira/browse/HUDI-5485
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>
> [https://github.com/apache/hudi/issues/7541]
> When metadata table is enabled, the savepoint operation is slow for a large 
> number of partitions (e.g., 75k).  The root cause is that for each partition, 
> the metadata table is scanned, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7559: [HUDI-5447] Adding read support for record level index in metadata table

2023-01-03 Thread GitBox


hudi-bot commented on PR #7559:
URL: https://github.com/apache/hudi/pull/7559#issuecomment-1370372976

   
   ## CI report:
   
   * 6857bb863668c8a6e83755a5d8a27812425ab586 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14063)
 
   * bdb5a418d12df6413bf94f3ff5149224139d6892 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7565: [SUPPORT] Memory Exception when building BuildProfile

2023-01-03 Thread GitBox


yihua commented on issue #7565:
URL: https://github.com/apache/hudi/issues/7565#issuecomment-1370371179

   Hi @jomach Thanks for raising the issue.  If you haven't, please check out 
the [Tuning Guide](https://hudi.apache.org/docs/tuning-guide/) for writing data 
to a Hudi table through a Spark job.
   
   We'll revisit the logic of the code snippet you pasted.  Usually, the number 
of combinations of partition path and record location (parquet file path) 
should be limited.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7597: [HUDI-5192] Prevent GH actions from running on trivial file changes

2023-01-03 Thread GitBox


hudi-bot commented on PR #7597:
URL: https://github.com/apache/hudi/pull/7597#issuecomment-1370370194

   
   ## CI report:
   
   * 9bdf2992ab2664d81eba0136b687949a7afdaa0d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14085)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7577: [SUPPORT]

2023-01-03 Thread GitBox


yihua commented on issue #7577:
URL: https://github.com/apache/hudi/issues/7577#issuecomment-1370366726

   Hi @Shagish thanks for raising this issue.  Could you share the write 
configs of your Hudi Spark job?  One possibility is that the metadata table 
might be out of sync with the data table for MOR with async table services in 
0.11.0 release.  To unblock the pipeline, you can try to disable metadata table 
with `hoodie.metadata.enable=false` or try the latest 0.12.2 release to see if 
the problem goes away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-5293) Schema on read + reconcile schema fails w/ 0.12.1

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5293:
-

Assignee: Jonathan Vexler

> Schema on read + reconcile schema fails w/ 0.12.1
> -
>
> Key: HUDI-5293
> URL: https://issues.apache.org/jira/browse/HUDI-5293
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> if I do schema on read on commit1 and then schema on read + reconcile schema 
> for 2nd batch, it fails w/ 
> {code:java}
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> 22/11/28 16:44:26 ERROR BaseSparkCommitActionExecutor: Error upserting 
> bucketType UPDATE for partition :2
> java.lang.IllegalArgumentException: cannot modify hudi meta col: 
> _hoodie_commit_time
>   at 
> org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
>   at 
> org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
>   at 
> java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchema(AvroSchemaEvolutionUtils.java:80)
>   at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:103)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
>  
>  
>  
>  
>  
>  



--
This messag

[jira] [Assigned] (HUDI-5356) Call close on SparkRDDWriteClient several places

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5356:
-

Assignee: Jonathan Vexler

> Call close on SparkRDDWriteClient several places
> 
>
> Key: HUDI-5356
> URL: https://issues.apache.org/jira/browse/HUDI-5356
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: dzcxzl
>Assignee: Jonathan Vexler
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5349) Clean up partially failed restore if any

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5349:
-

Assignee: Jonathan Vexler  (was: sivabalan narayanan)

> Clean up partially failed restore if any
> 
>
> Key: HUDI-5349
> URL: https://issues.apache.org/jira/browse/HUDI-5349
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 0.13.0
>
>
> If a table was attempted w/ "restore" operation and if it failed mid-way, 
> restore could still be lying around. when re-attempted, a new instant time 
> will be allotted and re-attempted from scratch. but this may thwart 
> compaction progression in MDT. so we need to ensure for a given savepoint, we 
> always re-use restore instant if any. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5370) Properly close file handles for Metadata writer

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5370.
-
Resolution: Fixed

> Properly close file handles for Metadata writer
> ---
>
> Key: HUDI-5370
> URL: https://issues.apache.org/jira/browse/HUDI-5370
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5361) Propagate Hudi properties set in Spark's SQLConf to Hudi

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5361:
-

Assignee: Jonathan Vexler  (was: Alexey Kudinkin)

> Propagate Hudi properties set in Spark's SQLConf to Hudi
> 
>
> Key: HUDI-5361
> URL: https://issues.apache.org/jira/browse/HUDI-5361
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.13.0
>
>
> Currently, the only property we propagate from Spark's SQLConf is 
> hoodie.metadata.enable.
> Instead we should actually pull all of the Hudi related configs from SQLConf 
> and pass them to Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5322) Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's schema

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5322:
-

Assignee: Jonathan Vexler

> Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's 
> schema
> 
>
> Key: HUDI-5322
> URL: https://issues.apache.org/jira/browse/HUDI-5322
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.13.0
>
>
> Row-writing Bulk-insert have to rewrite incoming dataset into the finalized 
> Writer's schema, instead it's currently just using incoming dataset as is, 
> deviating in semantic from non-Row-writing flow (alas other operations)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5372) Fix NPE caused by alter table add column

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5372:
-

Assignee: Jonathan Vexler

> Fix NPE caused by alter table add column
> 
>
> Key: HUDI-5372
> URL: https://issues.apache.org/jira/browse/HUDI-5372
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5370) Properly close file handles for Metadata writer

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5370:
--
Fix Version/s: 0.12.2

> Properly close file handles for Metadata writer
> ---
>
> Key: HUDI-5370
> URL: https://issues.apache.org/jira/browse/HUDI-5370
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5375) Fix re-using of file readers w/ metadata table in FileIndex

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5375:
--
Fix Version/s: 0.12.2

> Fix re-using of file readers w/ metadata table in FileIndex
> ---
>
> Key: HUDI-5375
> URL: https://issues.apache.org/jira/browse/HUDI-5375
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.12.1
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> In 0.12.1, looks like we had a bug where re-use readers for metadata was set 
> to false. 
> [https://github.com/apache/hudi/blob/a5978cd2308f0f2e501e12040f1fafae8afb86e9/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java#L272]
>  
> It was already fixed in master in 
> [https://github.com/apache/hudi/pull/6680/files#diff-0115903d86444e5decbeac54ac873d08cbf4175268e6c1b7fe3dc59591f73970,]
>  but we are not pulling the patch in for 0.12.2 since its a large patch. So, 
> we need to a focussed fix for 0.12.2 branch. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5375) Fix re-using of file readers w/ metadata table in FileIndex

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5375.
-
Resolution: Fixed

> Fix re-using of file readers w/ metadata table in FileIndex
> ---
>
> Key: HUDI-5375
> URL: https://issues.apache.org/jira/browse/HUDI-5375
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.12.1
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> In 0.12.1, looks like we had a bug where re-use readers for metadata was set 
> to false. 
> [https://github.com/apache/hudi/blob/a5978cd2308f0f2e501e12040f1fafae8afb86e9/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java#L272]
>  
> It was already fixed in master in 
> [https://github.com/apache/hudi/pull/6680/files#diff-0115903d86444e5decbeac54ac873d08cbf4175268e6c1b7fe3dc59591f73970,]
>  but we are not pulling the patch in for 0.12.2 since its a large patch. So, 
> we need to a focussed fix for 0.12.2 branch. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5383) Test 0.12.2 release branch

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5383.
-
  Assignee: sivabalan narayanan
Resolution: Fixed

> Test 0.12.2 release branch 
> ---
>
> Key: HUDI-5383
> URL: https://issues.apache.org/jira/browse/HUDI-5383
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5386) Cleaning conflicts in occ mode

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5386:
-

Assignee: Jonathan Vexler

> Cleaning conflicts in occ mode
> --
>
> Key: HUDI-5386
> URL: https://issues.apache.org/jira/browse/HUDI-5386
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: HunterXHunter
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-12-14-11-26-21-995.png, 
> image-2022-12-14-11-26-37-252.png
>
>
> {code:java}
> configuration parameter: 
> 'hoodie.cleaner.policy.failed.writes' = 'LAZY'
> 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control' {code}
> Because `getInstantsToRollback` is not locked, multiple writes get the same 
> `instantsToRollback`, the same `instant` will be deleted multiple times and 
> the same `rollback.inflight` will be created multiple times.
> !image-2022-12-14-11-26-37-252.png!
> !image-2022-12-14-11-26-21-995.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5383) Test 0.12.2 release branch

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5383:
--
Fix Version/s: 0.12.2

> Test 0.12.2 release branch 
> ---
>
> Key: HUDI-5383
> URL: https://issues.apache.org/jira/browse/HUDI-5383
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5461) Upsert after renaming the table fails due to table props validation

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5461:
-

Assignee: Jonathan Vexler

> Upsert after renaming the table fails due to table props validation
> ---
>
> Key: HUDI-5461
> URL: https://issues.apache.org/jira/browse/HUDI-5461
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: configs, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>
> This happens because the table is renamed in the catalog but not in 
> hoodie.props
> Exception:
> {code:java}
> org.apache.hudi.exception.HoodieException: Config conflict(keycurrent 
> value   existing value):
> hoodie.table.name:table_renamed   table1
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:167)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:90)
>   at 
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5462) Spark-sql certain commands are only allowed with v2 tables

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5462:
-

Assignee: Jonathan Vexler

> Spark-sql certain commands are only allowed with v2 tables
> --
>
> Key: HUDI-5462
> URL: https://issues.apache.org/jira/browse/HUDI-5462
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>
> Certain commands such as DROP COLUMNS, RENAME COLUMN are mentioned in [spark 
> documentation|https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html]
>  but not in our documentation. We should add it to our documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5457) Configuration documentation for hoodie.datasource.write.operation needs to be updated

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5457:
-

Assignee: Jonathan Vexler

> Configuration documentation for hoodie.datasource.write.operation needs to be 
> updated
> -
>
> Key: HUDI-5457
> URL: https://issues.apache.org/jira/browse/HUDI-5457
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs, dev-experience, docs
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Minor
>
> The current description of hoodie.datasource.write.operation uses both 
> "bulkinsert" and "bulk insert". The actual bulk insert value is 
> "BULK_INSERT". It also seems like there are more options than just those 
> three. Additionally, the grammar is not good.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-31) MOR - Allow partitioner to pick more than one small file for inserting new data #494

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-31:

Fix Version/s: 0.11.0

> MOR - Allow partitioner to pick more than one small file for inserting new 
> data #494
> 
>
> Key: HUDI-31
> URL: https://issues.apache.org/jira/browse/HUDI-31
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/494



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4755) INSERT_OVERWRITE(/TABLE) in spark sql should not fail time travel queries for older timestamps

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4755:
-

Assignee: Jonathan Vexler  (was: XiaoyuGeng)

> INSERT_OVERWRITE(/TABLE) in spark sql should not fail time travel queries for 
> older timestamps
> --
>
> Key: HUDI-4755
> URL: https://issues.apache.org/jira/browse/HUDI-4755
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>
> when INSERT_OVERWRITE or INSERT_OVERWRITE_TABLE is used in spark-sql, we 
> should still support time travel queries for older timestamps. 
>  
> Ref issue: https://github.com/apache/hudi/issues/6452



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-31) MOR - Allow partitioner to pick more than one small file for inserting new data #494

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-31:

Status: Open  (was: In Progress)

> MOR - Allow partitioner to pick more than one small file for inserting new 
> data #494
> 
>
> Key: HUDI-31
> URL: https://issues.apache.org/jira/browse/HUDI-31
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/494



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-31) MOR - Allow partitioner to pick more than one small file for inserting new data #494

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-31?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-31.
---
Resolution: Fixed

> MOR - Allow partitioner to pick more than one small file for inserting new 
> data #494
> 
>
> Key: HUDI-31
> URL: https://issues.apache.org/jira/browse/HUDI-31
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/494



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5460) Spark-sql ALTER TABLE SET TBLPROPERTIES never fails

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5460:
-

Assignee: Jonathan Vexler

> Spark-sql ALTER TABLE SET TBLPROPERTIES never fails
> ---
>
> Key: HUDI-5460
> URL: https://issues.apache.org/jira/browse/HUDI-5460
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>
> When updating the tblproperties, it never fails even though some properties 
> don't change. An error should be shown with a meaningful message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3775:
--
Sprint: 2022/09/05, 2023-01-09  (was: 2022/09/05)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix
> Fix For: 0.13.0
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-3775) Allow for offline compaction of MOR tables via spark streaming

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-3775:
-

Assignee: Jonathan Vexler  (was: sivabalan narayanan)

> Allow for offline compaction of MOR tables via spark streaming
> --
>
> Key: HUDI-3775
> URL: https://issues.apache.org/jira/browse/HUDI-3775
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: Rajesh
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: easyfix
> Fix For: 0.13.0
>
>
> Currently there is no way to avoid compaction taking up a lot of resources 
> when run inline or async for MOR tables via Spark Streaming. Delta Streamer 
> has ways to assign resources between ingestion and async compaction but Spark 
> Streaming does not have that option. 
> Introducing a flag to turn off automatic compaction and allowing users to run 
> compaction in a separate process will decouple both concerns.
> This will also allow the users to size the cluster just for ingestion and 
> deal with compaction separate without blocking.  We will need to look into 
> documenting best practices for running offline compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on issue #7589: Keep only clustered file(all) after cleaning

2023-01-03 Thread GitBox


yihua commented on issue #7589:
URL: https://github.com/apache/hudi/issues/7589#issuecomment-1370361242

   Hi @maheshguptags Thanks for the question.  To clarify, are you asking to 
keep the new parquet files after clustering, which replace the compacted file 
groups that have parquet files?  This should already be the case.  If not, 
could you provide reproducible steps to show the failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7590: Failed to rollback s3://s3_bucket/xml commits 20221231041647333

2023-01-03 Thread GitBox


yihua commented on issue #7590:
URL: https://github.com/apache/hudi/issues/7590#issuecomment-1370358130

   Hi @koochiswathiTR Thanks for raising the issue.  Could you share the Hudi 
write configs for this job?  It looks like that the timeline server failed to 
start due to the underlying Javalin server: `java.lang.RuntimeException: 
java.io .IOException: Too many open files at 
io.javalin.Javalin.start(Javalin.java:189 )`.   This might be a Javalin issue 
which cannot start the server with a port.  Retrying the same job should get 
around this.
   
   Removing the complained commit file, in this case, may lead to uncommitted 
data served by queries, as such data has to be rolled back first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5463:
--
Sprint: 0.13.0 Final Sprint  (was: 2023-01-09)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.0
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


yihua commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1370346300

   @umehrot2 @rahil-c do you folks have any information on this question?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7596: [SUPPORT] java.lang.NoSuchMethodException: org.apache.hudi.utilities.sources.AvroKafkaSource when running HoodieDeltaStreamer

2023-01-03 Thread GitBox


yihua commented on issue #7596:
URL: https://github.com/apache/hudi/issues/7596#issuecomment-1370343588

   Hi @afuyo thanks for reporting this issue.  This might be a configuration 
issue.  Have you tried the same Deltastreamer job in the [Docker 
Demo](https://hudi.apache.org/docs/docker_demo) and see if it can succeed? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] maddy2u commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


maddy2u commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1370299596

   Thanks for the information @tooptoop4 ! I will keep an eye on the PR for 
updates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7597: [HUDI-5192] Prevent GH actions from running on trivial file changes

2023-01-03 Thread GitBox


hudi-bot commented on PR #7597:
URL: https://github.com/apache/hudi/pull/7597#issuecomment-1370269415

   
   ## CI report:
   
   * 9bdf2992ab2664d81eba0136b687949a7afdaa0d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14085)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7597: [HUDI-5192] Prevent GH actions from running on trivial file changes

2023-01-03 Thread GitBox


hudi-bot commented on PR #7597:
URL: https://github.com/apache/hudi/pull/7597#issuecomment-1370264023

   
   ## CI report:
   
   * 9bdf2992ab2664d81eba0136b687949a7afdaa0d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #7595: [SUPPORT] Hudi Clean and Delta commits taking ~50 mins to finish frequently

2023-01-03 Thread GitBox


yihua commented on issue #7595:
URL: https://github.com/apache/hudi/issues/7595#issuecomment-1370256718

   Hi @BalaMahesh Thanks for raising the issue.  To better triage this, could 
you provide more details about the Hudi table, partitioned or non-partitioned 
table, how many partitions if partitioned, Spark driver logs, and screenshots 
of the stages which take a long time to finish?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5192) GH actions and azure ci tests run even for trivial fixes

2023-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5192:
-
Labels: pull-request-available  (was: )

> GH actions and azure ci tests run even for trivial fixes
> 
>
> Key: HUDI-5192
> URL: https://issues.apache.org/jira/browse/HUDI-5192
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Minor
>  Labels: pull-request-available
>
> PR's such as [https://github.com/apache/hudi/pull/7178] do not need to run gh 
> actions/azure ci. It is not necessary and takes resources from important 
> fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #7597: [HUDI-5192] Prevent GH actions from running on trivial file changes

2023-01-03 Thread GitBox


jonvex opened a new pull request, #7597:
URL: https://github.com/apache/hudi/pull/7597

   ### Change Logs
   
   If a pr only has changes to files with extensions of `bmp, gif, jpg, jpeg, 
md, pdf, png, svg` do not run GH actions. 
   
   ### Impact
   
   Reduce waiting time before gh actions starts running
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


tooptoop4 commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1370192167

   not sure when. for 2nd qn, at least in trino itself I think u can test that 
PR on a local patch build, I assume it will work as long as u define the right 
IAM permissions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] maddy2u commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


maddy2u commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1370181877

   @tooptoop4  : Any expectation on when this will be merged to Trinodb? I 
think it takes a while before it is made available within Athena post this.
   
   
   Not sure if this is the right place to ask the question but here it goes, Is 
there a way today to enable Cross AWS Account Hudi datasets to have time travel 
capabilities? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5493) Revisit the archival process wrt clustering

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5493:

Description: 
[https://github.com/apache/hudi/pull/7568]

 

The above PR fixes the case where the archival of a clustering replacecommit 
can lead to duplicate data when both the replaced and new file groups from the 
replacecommit co-exist in the Hudi table.

 

The new logic is complex.  We need to simplify the archival process.

> Revisit the archival process wrt clustering
> ---
>
> Key: HUDI-5493
> URL: https://issues.apache.org/jira/browse/HUDI-5493
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> [https://github.com/apache/hudi/pull/7568]
>  
> The above PR fixes the case where the archival of a clustering replacecommit 
> can lead to duplicate data when both the replaced and new file groups from 
> the replacecommit co-exist in the Hudi table.
>  
> The new logic is complex.  We need to simplify the archival process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5493) Revisit the archival process wrt clustering

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5493:

Fix Version/s: 0.14.0

> Revisit the archival process wrt clustering
> ---
>
> Key: HUDI-5493
> URL: https://issues.apache.org/jira/browse/HUDI-5493
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> [https://github.com/apache/hudi/pull/7568]
>  
> The above PR fixes the case where the archival of a clustering replacecommit 
> can lead to duplicate data when both the replaced and new file groups from 
> the replacecommit co-exist in the Hudi table.
>  
> The new logic is complex.  We need to simplify the archival process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5493) Revisit the archival process wrt clustering

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5493:

Priority: Critical  (was: Major)

> Revisit the archival process wrt clustering
> ---
>
> Key: HUDI-5493
> URL: https://issues.apache.org/jira/browse/HUDI-5493
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> [https://github.com/apache/hudi/pull/7568]
>  
> The above PR fixes the case where the archival of a clustering replacecommit 
> can lead to duplicate data when both the replaced and new file groups from 
> the replacecommit co-exist in the Hudi table.
>  
> The new logic is complex.  We need to simplify the archival process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on pull request #7568: [HUDI-5341] CleanPlanner retains earliest commits must not be later than earliest pending commit

2023-01-03 Thread GitBox


yihua commented on PR #7568:
URL: https://github.com/apache/hudi/pull/7568#issuecomment-1370168033

   [HUDI-5493](https://issues.apache.org/jira/browse/HUDI-5493) for revisiting 
the logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5493) Revisit the archival process wrt clustering

2023-01-03 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-5493:
---

 Summary: Revisit the archival process wrt clustering
 Key: HUDI-5493
 URL: https://issues.apache.org/jira/browse/HUDI-5493
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] soumilshah1995 commented on issue #7591: [SUPPORT] Kinesis Data Analytics Flink1.13 to HUDI

2023-01-03 Thread GitBox


soumilshah1995 commented on issue #7591:
URL: https://github.com/apache/hudi/issues/7591#issuecomment-1370164850

   Here is details again i have tried again this morning 
   
    Please note This time i am on US-WEST-2 previously i was trying on 
US-EAST-1
   
    Kinesis Streams 
   
![image](https://user-images.githubusercontent.com/39345855/210422439-4c008e1e-ff85-421b-8fc5-75b8b635896d.png)
   
   ### Python Code to Dump Dummy Data 
   ```
   try:
   import datetime
   import json
   import random
   import boto3
   import os
   import uuid
   import time
   from faker import Faker
   
   from dotenv import load_dotenv
   load_dotenv(".env")
   except Exception as e:
   pass
   
   global faker
   faker = Faker()
   
   
   def getReferrer():
   data = {}
   now = datetime.now()
   str_now = now.isoformat()
   data['uuid'] = str(uuid.uuid4())
   data['event_time'] = str_now
   
   data['ticker'] = random.choice(['AAPL', 'AMZN', 'MSFT', 'INTC', 'TBV'])
   price = random.random() * 100
   data['price'] = round(price, 2)
   return data
   
   
   while True:
   data = json.dumps(getReferrer())
   print(data)
   global kinesis_client
   
   kinesis_client = boto3.client('kinesis',
 
region_name=os.getenv("DEV_AWS_REGION_NAME"),
 
aws_access_key_id=os.getenv("DEV_ACCESS_KEY"),
 
aws_secret_access_key=os.getenv("DEV_SECRET_KEY")
 )
   
   res = kinesis_client.put_record(
   StreamName="stock-streams",
   Data=data,
   PartitionKey="1")
   time.sleep(3)
   
   
   ```
    KDA
   
   
![image](https://user-images.githubusercontent.com/39345855/210422526-8e61f16e-c457-4aea-8a2f-c59ec03172f7.png)
   
    Settings Added JAR Files 
   
![image](https://user-images.githubusercontent.com/39345855/210422789-a6f08f42-2ad9-46c6-a8d9-b05ba150a6ba.png)
   
   ```
   %flink.ssql(type=update)
   
   DROP TABLE if exists stock_table;
   
   CREATE TABLE stock_table (
   uuid varchar,
   ticker VARCHAR,
   price DOUBLE,
   event_time TIMESTAMP(3),
   WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
   )
   PARTITIONED BY (ticker)
   WITH (
   'connector' = 'kinesis',
   'stream' = 'stock-streams',
   'aws.region' = 'us-west-2',
   'scan.stream.initpos' = 'LATEST',
   'format' = 'json',
   'json.timestamp-format.standard' = 'ISO-8601'
   );
   ```
   
![image](https://user-images.githubusercontent.com/39345855/210428825-c54b6e4b-c262-409d-ae25-ccb134c0b011.png)
   
   ```
   %flink.ssql(type=update)
   
   DROP TABLE if exists stock_table_hudi;
   
   CREATE TABLE stock_table_hudi(
   uuid varchar,
   ticker VARCHAR,
   price DOUBLE,
   event_time TIMESTAMP(3)
   )
   WITH (
 'connector' = 'hudi',
 'path' = 's3://soumil-dms-learn',
 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by 
default is COPY_ON_WRITE
   );
   ```
   
![image](https://user-images.githubusercontent.com/39345855/210429264-f131e2ab-14ee-47a9-beff-a98c8c598b3f.png)
   
   # Real Time Data 
   
![image](https://user-images.githubusercontent.com/39345855/210429752-6baaff85-eaf6-44b2-bb9e-cdf68b34ac06.png)
   
   
   ## Error Messages Same as above
   
   ```
   ConnectException: Connection refused
   java.io.IOException: org.apache.flink.util.FlinkException: Failed to execute 
job 'INSERT INTO stock_table_hudi 
   SELECT  uuid, ticker, price, event_time as ts from stock_table'.
at 
org.apache.zeppelin.flink.FlinkSqlInterrpeter.callInsertInto(FlinkSqlInterrpeter.java:538)
at 
org.apache.zeppelin.flink.FlinkStreamSqlInterpreter.callInsertInto(FlinkStreamSqlInterpreter.java:97)
at 
org.apache.zeppelin.flink.FlinkSqlInterrpeter.callCommand(FlinkSqlInterrpeter.java:273)
at 
org.apache.zeppelin.flink.FlinkSqlInterrpeter.runSqlList(FlinkSqlInterrpeter.java:160)
at 
org.apache.zeppelin.flink.FlinkSqlInterrpeter.internalInterpret(FlinkSqlInterrpeter.java:112)
at 
org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:852)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:744)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at 
org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at 
org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:46)
at 
java.base/java.util.concurrent.ThreadPool

[jira] [Updated] (HUDI-3411) Incorrect Record Key Field property Handling

2023-01-03 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-3411:
--
Status: Open  (was: In Progress)

> Incorrect Record Key Field property Handling
> 
>
> Key: HUDI-3411
> URL: https://issues.apache.org/jira/browse/HUDI-3411
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently `HoodieTableConfig#getRecordKeyFieldProp` returns a single String, 
> even though it could contain a *list* of columns making up composite Primary 
> Key.
> {code:java}
> public String getRecordKeyFieldProp() {
>   return getStringOrDefault(RECORDKEY_FIELDS, 
> HoodieRecord.RECORD_KEY_METADATA_FIELD);
> } {code}
>  
> Most of the callers of this method are actually not handling this correctly, 
> assuming that the Record Key is always a single field. 
> NOTE: While concatenation of CPK seems like a very natural step here, special 
> care has to be taken, since Composite PK can NOT be concatenated as strings, 
> as this might break the uniqueness constraint. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5023:

Status: Patch Available  (was: In Progress)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5023:

Status: In Progress  (was: Reopened)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5023:

Sprint: 2022/11/15, 2022/11/29, 0.13.0 Final Sprint  (was: 2022/11/15, 
2022/11/29)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5023:

Reviewers: Ethan Guo, sivabalan narayanan  (was: sivabalan narayanan)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-5023) Add new Executor avoiding Queueing in the write-path

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reopened HUDI-5023:
-
  Assignee: Alexey Kudinkin  (was: Yue Zhang)

> Add new Executor avoiding Queueing in the write-path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on a diff in pull request #7476: [HUDI-5023] Switching default Write Executor type to `SIMPLE`

2023-01-03 Thread GitBox


yihua commented on code in PR #7476:
URL: https://github.com/apache/hudi/pull/7476#discussion_r1060867754


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2449,12 +2450,17 @@ public Builder withWriteBufferLimitBytes(int 
writeBufferLimit) {
 }
 
 public Builder withWriteWaitStrategy(String waitStrategy) {
-  writeConfig.setValue(WRITE_WAIT_STRATEGY, String.valueOf(waitStrategy));
+  writeConfig.setValue(WRITE_EXECUTOR_DISRUPTOR_WAIT_STRATEGY, 
String.valueOf(waitStrategy));
   return this;
 }
 
 public Builder withWriteBufferSize(int size) {
-  writeConfig.setValue(WRITE_DISRUPTOR_BUFFER_SIZE, String.valueOf(size));
+  writeConfig.setValue(WRITE_EXECUTOR_DISRUPTOR_BUFFER_SIZE, 
String.valueOf(size));
+  return this;
+}

Review Comment:
   nit: If this is not intended to use as a public API, could you remove this 
and only keep `Builder#withWriteExecutorDisruptorWriteBufferSize` which has the 
same functionality?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2857,8 +2863,15 @@ private void validate() {
 }
 
 public HoodieWriteConfig build() {
+  return build(true);
+}
+
+@VisibleForTesting
+public HoodieWriteConfig build(boolean shouldValidate) {

Review Comment:
   should this method use the default scope/visibility instead of `public`?



##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/TestSimpleExecutionInSpark.java:
##
@@ -88,10 +81,10 @@ public Integer finish() {
 return count;
   }
 };
-SimpleHoodieExecutor>, Integer> exec = null;
+SimpleExecutor>, 
Integer> exec = null;
 
 try {
-  exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, 
getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA));
+  exec = new SimpleExecutor(hoodieRecords.iterator(), consumer, 
Function.identity());

Review Comment:
   Still use `getTransformer` instead of `Function.identity()` for the tests?



##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/execution/TestDisruptorExecutionInSpark.java:
##
@@ -35,24 +36,27 @@
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.Timeout;
+import scala.Tuple2;
 
 import java.util.ArrayList;
 import java.util.List;
 
-import scala.Tuple2;
-
-import static 
org.apache.hudi.execution.HoodieLazyInsertIterable.getCloningTransformer;
+import static 
org.apache.hudi.execution.HoodieLazyInsertIterable.getTransformer;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 import static org.junit.jupiter.api.Assertions.assertTrue;
-import static org.mockito.Mockito.mock;
-import static org.mockito.Mockito.when;
 
 public class TestDisruptorExecutionInSpark extends HoodieClientTestHarness {
 
   private final String instantTime = 
HoodieActiveTimeline.createNewInstantTime();
 
+
+  private final HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder()

Review Comment:
   nit: (not required in this PR) the tests using different types of write 
executors should be generalized in a base class, like `TestWriteMarkersBase`, 
instead of duplicating the code.  This can be refactored later on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


hudi-bot commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1370124079

   
   ## CI report:
   
   * 96c0d86652aa342ad9b13bccc30020830ef8d204 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14084)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1370118321

   
   ## CI report:
   
   * 1ac267ba9af690ecd47f74f60c34851387aee9eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14080)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14083)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on issue #5687: [SUPPORT]hudi sql parser ignores all exceptions of spark sql parser

2023-01-03 Thread GitBox


jonvex commented on issue #5687:
URL: https://github.com/apache/hudi/issues/5687#issuecomment-1370096424

   I was not able to reproduce the error by running 
   ```
   select CAST(-123456789 AS TIMESTAMP) as de;
   ```
   In spark-sql with or without hudi.
   I was able to confirm that I did have `spark.sql.ansi.enabled = true` set 
because running 
   ```
   SELECT CAST('a' AS INT) as DE;
   ```
   produced an exception that was not produced when `spark.sql.ansi.enabled = 
false`.
   Additionally, the exception produced from Hudi started with 
   ```
   org.apache.spark.SparkNumberFormatException: The value 'a' of the type 
"STRING" cannot be cast to "INT" because it is malformed. Correct the value as 
per the syntax, or change its target type. Use `try_cast` to tolerate malformed 
input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error.
   == SQL(line 1, position 8) ==
   SELECT CAST('a' AS INT) as DE
  
   ```
   which seems to be the spark exception information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rahil-c commented on pull request #7584: [HUDI-5205] Support Flink 1.16.0

2023-01-03 Thread GitBox


rahil-c commented on PR #7584:
URL: https://github.com/apache/hudi/pull/7584#issuecomment-1370087863

   @stayrascal @danny0405 Thanks for making this change, just to confirm this 
is targeted for Hudi 0.13.0? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5442:

Status: In Progress  (was: Open)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
>     at

[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5442:

Status: Open  (was: Patch Available)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
>   

[jira] [Updated] (HUDI-5485) Improve performance of savepoint with MDT

2023-01-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5485:

Status: In Progress  (was: Open)

> Improve performance of savepoint with MDT
> -
>
> Key: HUDI-5485
> URL: https://issues.apache.org/jira/browse/HUDI-5485
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.13.0
>
>
> [https://github.com/apache/hudi/issues/7541]
> When metadata table is enabled, the savepoint operation is slow for a large 
> number of partitions (e.g., 75k).  The root cause is that for each partition, 
> the metadata table is scanned, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex commented on issue #7494: FileNotFoundException while writing dataframe to local file system

2023-01-03 Thread GitBox


jonvex commented on issue #7494:
URL: https://github.com/apache/hudi/issues/7494#issuecomment-1370067142

   Here are the steps that I tried:
   
   1. Download 
[spark-3.3.1-bin-hadoop3.tgz](https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz)
   2. Set the environment variables 
   ```
   export SPARK_HOME=/Users/jon/Documents/spark-3.3.1-bin-hadoop3
   export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
   export PYSPARK_SUBMIT_ARGS="--master local[*]"
   export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
   export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
   export PYSPARK_PYTHON=$(which python3)
   ```
   3. Ran the command to start 
   ```
   $SPARK_HOME/bin/pyspark \
   --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.2 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   ```
   4. Then In pyspark I ran the following
   ```
   tableName = "hudi_trips_cow"
   basePath = "file:///tmp/hudi_trips_cow"
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   
   hudi_options = {
   'hoodie.table.name': tableName,
   'hoodie.datasource.write.recordkey.field': 'uuid',
   'hoodie.datasource.write.partitionpath.field': 'partitionpath',
   'hoodie.datasource.write.table.name': tableName,
   'hoodie.datasource.write.operation': 'upsert',
   'hoodie.datasource.write.precombine.field': 'ts',
   'hoodie.upsert.shuffle.parallelism': 2,
   'hoodie.insert.shuffle.parallelism': 2
   }
   
   df.write.format("hudi"). \
   options(**hudi_options). \
   mode("overwrite"). \
   save(basePath)
   ```
   Following those steps I was unable to produce the issue presented


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


xushiyan commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1370045864

   can you also edit PR description as per the template? it's failing the 
validate pr check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


xushiyan commented on code in PR #7572:
URL: https://github.com/apache/hudi/pull/7572#discussion_r1060804456


##
hudi-common/src/main/java/org/apache/hudi/common/util/RetryHelper.java:
##
@@ -69,12 +69,12 @@ public RetryHelper(long maxRetryIntervalMs, int 
maxRetryNumbers, long initialRet
 this.taskInfo = taskInfo;
   }
 
-  public RetryHelper tryWith(CheckedFunction func) {
+  public RetryHelper tryWith(CheckedFunction func) {
 this.func = func;
 return this;
   }
 
-  public T start(CheckedFunction func) throws IOException {
+  public  T start(CheckedFunction func) throws R {

Review Comment:
   /nit `` here is redundant. `CheckedFunction` already 
declares `R`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #6732: [HUDI-4148] Add client for hudi table service manager

2023-01-03 Thread GitBox


xushiyan commented on code in PR #6732:
URL: https://github.com/apache/hudi/pull/6732#discussion_r1060687626


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##
@@ -85,6 +91,8 @@ protected BaseHoodieClient(HoodieEngineContext context, 
HoodieWriteConfig client
   public void close() {
 stopEmbeddedServerView(true);
 this.context.setJobStatus("", "");
+this.heartbeatClient.stop();
+this.txnManager.close();

Review Comment:
   remark: this is common to all clients, so moved here.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -0,0 +1,840 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.async.AsyncArchiveService;
+import org.apache.hudi.async.AsyncCleanerService;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieClusteringPlan;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackPlan;
+import org.apache.hudi.client.heartbeat.HeartbeatUtils;
+import org.apache.hudi.common.HoodiePendingRollbackInfo;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.ActionType;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.model.TableServiceType;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieRollbackException;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.HoodieWriteMetadata;
+import org.apache.hudi.table.action.rollback.RollbackUtils;
+
+import com.codahale.metrics.Timer;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import javax.annotation.Nullable;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public abstract class BaseHoodieTableServiceClient extends BaseHoodieClient 
implements RunsTableService {

Review Comment:
   remark: logic for table services like clean, cluster, compact, archive, 
rollback were moved here from base write client



##
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/HoodieJavaTableServiceClient.java:
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the L

[GitHub] [hudi] jonvex commented on issue #7494: FileNotFoundException while writing dataframe to local file system

2023-01-03 Thread GitBox


jonvex commented on issue #7494:
URL: https://github.com/apache/hudi/issues/7494#issuecomment-1369997447

   @idatya I'm still looking into this, but it would be helpful to know if are 
you using one of the Hudi release branches or if you are using master?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7574: [SUPPORT] Upsert job failing while upgrading from 0.7 to 0.10.1

2023-01-03 Thread GitBox


nsivabalan commented on issue #7574:
URL: https://github.com/apache/hudi/issues/7574#issuecomment-1369967197

   @amitbans : 
   can you paste the write configs you are using.
   also, screen short of jobs and stages page from sparkUI as well. For the 
particular job and stage thats failing, if you can click on "+details" and show 
us the stacktrace, that would be nice as well (bcoz, sometimes, the job/stage 
description might not match the exact code) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


hudi-bot commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1369961264

   
   ## CI report:
   
   * f6a856197ee645a4dcb155a7616c6363829c5d37 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14060)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14081)
 
   * 96c0d86652aa342ad9b13bccc30020830ef8d204 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14084)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


hudi-bot commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1369953527

   
   ## CI report:
   
   * f6a856197ee645a4dcb155a7616c6363829c5d37 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14060)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14081)
 
   * 96c0d86652aa342ad9b13bccc30020830ef8d204 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1369945551

   
   ## CI report:
   
   * 1ac267ba9af690ecd47f74f60c34851387aee9eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14080)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14083)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


hudi-bot commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1369945409

   
   ## CI report:
   
   * f6a856197ee645a4dcb155a7616c6363829c5d37 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14060)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14081)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] afuyo opened a new issue, #7596: [SUPPORT] java.lang.NoSuchMethodException: org.apache.hudi.utilities.sources.AvroKafkaSource when running HoodieDeltaStreamer

2023-01-03 Thread GitBox


afuyo opened a new issue, #7596:
URL: https://github.com/apache/hudi/issues/7596

   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   - Yes
   **Describe the problem you faced**
   Exception when running HoodieDeltaStreamer: Could not load class 
org.apache.hudi.utilities.sources.AvroKafkaSource
   A clear and concise description of the problem.
   I want to use streaming ingestion feature using DelatStreamer but run into 
`java.lang.NoSuchMethodException: 
org.apache.hudi.utilities.sources.AvroKafkaSource.(org.apache.hudi.common.config.TypedProperties,
 org.apache.spark.api.java.JavaSparkContext, org.apache.spark.sql.SparkSession, 
org.apache.hudi.utilities.schema.SchemaProvider)`
   
   It kind of looks like version mismatch and I might be missing some obvious 
configuration. :) 
   
   **To Reproduce**
   ``` spark-submit --jars /opt/spark/hudi-spark3.1-bundle_2.12-0.12.1.jar \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/opt/spark/hudi-utilities-bundle_2.12-0.12.1.jar \
--props /opt/spark/kafka-source.properties \
--schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field f1 \
--target-base-path /home/azureuser/hudi-t1a \
--target-table hudi-t1a \
   --op INSERT \
   --filter-dupes \
--table-type COPY_ON_WRITE \
--continuous  ```
   
   
   **Expected behavior**
   Streaming ingestion
   **Environment Description**
   
   * Hudi version : 0.10,0.11, 0.12
   
   * Spark version :3.1.3
   
   * Hive version :
   
   * Hadoop version : 3.2
   
   * Storage (HDFS/S3/GCS..) : local
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   I tried to run it on Azure Spark pool but facing the same errors when 
running in my local machine. 
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   ```
spark-submit  --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/opt/spark/hudi-utilities-bundle_2.11-0.10.1.jar --props 
/opt/spark/kafka-source.properties --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class 
org.apache.hudi.utilities.sources.AvroKafkaSource --source-ordering-field f1 
--target-base-path /opt/spark/hudi-t1a --target-table hudi-t1a --op INSERT 
--filter-dupes --table-type COPY_ON_WRITE --continuous
   WARNING: An illegal reflective access operation has occurred
   WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.1.3.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
   WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
   WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
   WARNING: All illegal access operations will be denied in a future release
   23/01/03 10:11:50 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   log4j:WARN No appenders could be found for logger 
(org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator).
   log4j:WARN Please initialize the log4j system properly.
   log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   23/01/03 10:11:51 INFO SparkContext: Running Spark version 3.1.3
   23/01/03 10:11:51 INFO ResourceUtils: 
==
   23/01/03 10:11:51 INFO ResourceUtils: No custom resources configured for 
spark.driver.
   23/01/03 10:11:51 INFO ResourceUtils: 
==
   23/01/03 10:11:51 INFO SparkContext: Submitted application: 
delta-streamer-hudi-t1a
   23/01/03 10:11:51 INFO ResourceProfile: Default ResourceProfile created, 
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: 
offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: 
cpus, amount: 1.0)
   23/01/03 10:11:51 INFO ResourceProfile: Limiting resource is cpu
   23/01/03 10:11:51 INFO ResourceProfileManager: Added ResourceProfile id: 0
   23/01/03 10:11:51 INFO SecurityManager: Changing view acls to: azureuser
   23/01/03 10:11:51 INFO SecurityManager: Changing modify acls to: azureuser
   23/01/03 10:11:51 INFO SecurityManager: Changing view acls groups to:
   23/01/03 10:11:51 INFO SecurityManager: Changing modify acls groups to:
   23/01/03 10:11:51 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(azureuser); 
groups with view permissions: Set(); users  with modify permissions: 
Set(azureuser); groups with modify permissions: Set()
   23/01/03 10:11:51 INF

[GitHub] [hudi] hudi-bot commented on pull request #7584: [HUDI-5205] Support Flink 1.16.0

2023-01-03 Thread GitBox


hudi-bot commented on PR #7584:
URL: https://github.com/apache/hudi/pull/7584#issuecomment-1369862519

   
   ## CI report:
   
   * efd000d200790d748acb49ad79cd2ff09db64d73 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14082)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7584: [HUDI-5205] Support Flink 1.16.0

2023-01-03 Thread GitBox


hudi-bot commented on PR #7584:
URL: https://github.com/apache/hudi/pull/7584#issuecomment-1369855337

   
   ## CI report:
   
   * 1e9fa1cc5c993845a338616532ebc189772a181c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14051)
 
   * efd000d200790d748acb49ad79cd2ff09db64d73 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14082)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7584: [HUDI-5205] Support Flink 1.16.0

2023-01-03 Thread GitBox


hudi-bot commented on PR #7584:
URL: https://github.com/apache/hudi/pull/7584#issuecomment-1369848321

   
   ## CI report:
   
   * 1e9fa1cc5c993845a338616532ebc189772a181c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14051)
 
   * efd000d200790d748acb49ad79cd2ff09db64d73 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7593: [HUDI-5492] spark call command show_compaction doesn't return the com…

2023-01-03 Thread GitBox


hudi-bot commented on PR #7593:
URL: https://github.com/apache/hudi/pull/7593#issuecomment-1369841163

   
   ## CI report:
   
   * 8dac276274844f65a48d2e877a3cb1ed1d4ec3e3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14079)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1369841004

   
   ## CI report:
   
   * 1ac267ba9af690ecd47f74f60c34851387aee9eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14080)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


hudi-bot commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1369748183

   
   ## CI report:
   
   * f6a856197ee645a4dcb155a7616c6363829c5d37 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14060)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14081)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


tooptoop4 commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1369727346

   https://github.com/trinodb/trino/pull/15084 hasn't been merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh opened a new issue, #7595: [SUPPORT] Hudi Clean and Delta commits taking ~50 mins to finish frequently

2023-01-03 Thread GitBox


BalaMahesh opened a new issue, #7595:
URL: https://github.com/apache/hudi/issues/7595

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We have a Hudi table with metadata enabled and using delta streamer, async 
clean, async compact services . Delta commit and clean operations are taking 
~50 minutes frequently. 
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run 0.12.1 version Hudi with metadata table enabled.
   2.Enable async compaction and cleaner services. 
   3. use the below configuration.
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.cleaner.commits.retained=5
   hoodie.keep.min.commits=10
   #compaction config
   hoodie.datasource.compaction.async.enable=true
   hoodie.parquet.small.file.limit=1048576
   hoodie.compaction.target.io=50
   hoodie.metadata.metrics.enable=true
   
   hoodie.metadata.index.bloom.filter.enable=false
   hoodie.metadata.index.column.stats.enable=false 
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   hoodie.cleaner.policy.failed.writes=LAZY
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
   hoodie.write.lock.wait_time_ms=30
   4.
   
   **Expected behavior**
   
   Delta commit and clean actions should not take longer times.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.2.1
   
   * Hive version : 2.3.5
   
   * Hadoop version : 2.7.7
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   Delta streamer runs in continuous mode. 
   
   Adding the screenshot of timeline of operations 
   
   https://user-images.githubusercontent.com/25053668/210357883-9b72667c-1afe-4d0f-ab77-14c9a8ee0e32.png";>
   
   
   https://user-images.githubusercontent.com/25053668/210358097-9806715e-0e5f-44cf-9976-f478841a1433.png";>
   
   Below is the only error I see in logs. 
   
   **Stacktrace**
   
   ``` RequestHandler: Bad request response due to client view behind server 
view. Last known instant from client was 20230103113021745 but server has the 
following timeline [[20221128033016359__rollback__COMPLETED], 
[20221128042615784__rollback__COMPLETED], 
[20221128052249948__rollback__COMPLETED], 
[20221128100542977__rollback__COMPLETED], 
[20221128114411534__rollback__COMPLETED], 
[20221128121237952__rollback__COMPLETED], 
[20221128121547373__rollback__COMPLETED], 
[20221128124007294__rollback__COMPLETED], 
[20221128130510784__rollback__COMPLETED], 
[20221128150135765__rollback__COMPLETED], 
[20221202082857955__rollback__COMPLETED], 
[20221202083358380__rollback__COMPLETED], 
[20221205180609234__rollback__COMPLETED], 
[20221213024840399__rollback__COMPLETED], 
[20221215121336002__rollback__COMPLETED], 
[20230103075416732__clean__COMPLETED], [20230103080003681__clean__COMPLETED], 
[20230103080537813__clean__COMPLETED], [20230103081110194__clean__COMPLETED], 
[20230103081642791__clean__COMPLETED]
 , [20230103082158513__clean__COMPLETED], 
[20230103082749103__clean__COMPLETED], [20230103083327661__clean__COMPLETED], 
[20230103083915577__clean__COMPLETED], [20230103084450294__clean__COMPLETED], 
[20230103085022170__clean__COMPLETED], 
[20230103085539296__deltacommit__COMPLETED], 
[20230103085550414__clean__COMPLETED], 
[20230103090129353__deltacommit__COMPLETED], 
[20230103090140117__clean__COMPLETED], 
[20230103090705599__deltacommit__COMPLETED], 
[20230103090716308__clean__COMPLETED], 
[20230103091245975__deltacommit__COMPLETED], 
[20230103091256846__clean__COMPLETED], 
[20230103091825253__deltacommit__COMPLETED], 
[20230103091836101__clean__COMPLETED], 
[20230103092403683__deltacommit__COMPLETED], 
[20230103092414824__clean__COMPLETED], [20230103092828723__commit__COMPLETED], 
[20230103092851264__clean__COMPLETED], 
[20230103092923310__deltacommit__COMPLETED], 
[20230103093158260__clean__COMPLETED], 
[20230103102048896__deltacommit__COMPLETED], 
[20230103102100480__clean__COMPLETED], [202301031
 02637434__deltacommit__COMPLETED], [20230103102648856__clean__COMPLETED], 
[20230103103218354__deltacommit__COMPLETED], 
[20230103103229738__clean__COMPLETED], 
[20230103103812033__deltacommit__COMPLETED], 
[20230103103823381__clean__COMPLETED], 
[20230103104351306__deltacommit__COMPLETED], 
[20230103104402684__clean__COMPLETED], 
[20

[GitHub] [hudi] minihippo commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2023-01-03 Thread GitBox


minihippo commented on code in PR #5064:
URL: https://github.com/apache/hudi/pull/5064#discussion_r1060528770


##
hudi-platform-service/hudi-metaserver/src/main/java/org/apache/hudi/metaserver/client/HoodieMetaserverClientImp.java:
##
@@ -108,35 +108,42 @@ public void createTable(Table table) {
 }
   }
 
+  @Override
   public List listInstants(String db, String tb, int commitNum) 
{
 return exceptionWrapper(() -> this.client.listInstants(db, tb, 
commitNum).stream()
-.map(EntityConvertor::fromTHoodieInstant)
+.map(EntityConversions::fromTHoodieInstant)
 .collect(Collectors.toList())).get();
   }
 
+  @Override
   public Option getInstantMeta(String db, String tb, HoodieInstant 
instant) {
-ByteBuffer bytes = exceptionWrapper(() -> this.client.getInstantMeta(db, 
tb, EntityConvertor.toTHoodieInstant(instant))).get();
-Option res = bytes.capacity() == 0 ? Option.empty() : 
Option.of(bytes.array());
-return res;
+ByteBuffer byteBuffer = exceptionWrapper(() -> 
this.client.getInstantMeta(db, tb, 
EntityConversions.toTHoodieInstant(instant))).get();
+byte[] bytes = new byte[byteBuffer.remaining()];
+byteBuffer.get(bytes);
+return byteBuffer.hasRemaining() ? Option.empty() : Option.of(bytes);

Review Comment:
   There is the basic api `HoodieActiveTimeline#transitionRequestedToInflight` 
which provides commit an empty instant meta
   
![image](https://user-images.githubusercontent.com/17903481/210355049-cc304ae1-495c-4c8a-b9ef-6228204aa0a6.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] minihippo commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2023-01-03 Thread GitBox


minihippo commented on code in PR #5064:
URL: https://github.com/apache/hudi/pull/5064#discussion_r1060528770


##
hudi-platform-service/hudi-metaserver/src/main/java/org/apache/hudi/metaserver/client/HoodieMetaserverClientImp.java:
##
@@ -108,35 +108,42 @@ public void createTable(Table table) {
 }
   }
 
+  @Override
   public List listInstants(String db, String tb, int commitNum) 
{
 return exceptionWrapper(() -> this.client.listInstants(db, tb, 
commitNum).stream()
-.map(EntityConvertor::fromTHoodieInstant)
+.map(EntityConversions::fromTHoodieInstant)
 .collect(Collectors.toList())).get();
   }
 
+  @Override
   public Option getInstantMeta(String db, String tb, HoodieInstant 
instant) {
-ByteBuffer bytes = exceptionWrapper(() -> this.client.getInstantMeta(db, 
tb, EntityConvertor.toTHoodieInstant(instant))).get();
-Option res = bytes.capacity() == 0 ? Option.empty() : 
Option.of(bytes.array());
-return res;
+ByteBuffer byteBuffer = exceptionWrapper(() -> 
this.client.getInstantMeta(db, tb, 
EntityConversions.toTHoodieInstant(instant))).get();
+byte[] bytes = new byte[byteBuffer.remaining()];
+byteBuffer.get(bytes);
+return byteBuffer.hasRemaining() ? Option.empty() : Option.of(bytes);

Review Comment:
   There is the basic api which provides commit an empty instant meta
   
![image](https://user-images.githubusercontent.com/17903481/210355049-cc304ae1-495c-4c8a-b9ef-6228204aa0a6.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] soumilshah1995 commented on issue #7591: [SUPPORT] Kinesis Data Analytics Flink1.13 to HUDI

2023-01-03 Thread GitBox


soumilshah1995 commented on issue #7591:
URL: https://github.com/apache/hudi/issues/7591#issuecomment-1369696915

   Hey Danny
   Yes I did try that yesterday I could not get it to work.
   I keep getting this same error message
   
   
   
   
   On Tue, Jan 3, 2023 at 12:08 AM Danny Chan ***@***.***> wrote:
   
   > I see you use the streaming execution mode for VALUES SQL statement, did
   > you try the batch execution mode instead then ?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   -- 
   Thanking You,
   Soumil Nitin Shah
   
   B.E in Electronic
   MS Electrical Engineering
   MS  Computer Engineering
   +1-646 204 5957
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6524: [HUDI-4717] CompactionCommitEvent message corrupted when sent by compact_task

2023-01-03 Thread GitBox


danny0405 commented on PR #6524:
URL: https://github.com/apache/hudi/pull/6524#issuecomment-1369691323

   > > Does #7399 solve your problem here ?
   > 
   > Yeah, we cherry-pick [#7399](https://github.com/apache/hudi/pull/7399). 
But if user enable latency-marker, there is still thread-safety problem.
   
   Does the Operator has any interface to handle the latency-marker yet ? Just 
like what we do to watermark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] minihippo commented on pull request #7572: [HUDI-5483]Make retryhelper more suitable for common use.

2023-01-03 Thread GitBox


minihippo commented on PR #7572:
URL: https://github.com/apache/hudi/pull/7572#issuecomment-1369688545

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1369686205

   
   ## CI report:
   
   * 92b8c60d309978d24aa33badba2cd4d9f0640b18 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14078)
 
   * 1ac267ba9af690ecd47f74f60c34851387aee9eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14080)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1369681168

   
   ## CI report:
   
   * 70796357cee7f956d7ad595f27fa2a8e8524d798 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14030)
 
   * 92b8c60d309978d24aa33badba2cd4d9f0640b18 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14078)
 
   * 1ac267ba9af690ecd47f74f60c34851387aee9eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7593: [HUDI-5492] spark call command show_compaction doesn't return the com…

2023-01-03 Thread GitBox


hudi-bot commented on PR #7593:
URL: https://github.com/apache/hudi/pull/7593#issuecomment-1369676071

   
   ## CI report:
   
   * e7dad5ff4526bd3ce1f93c0a6143f919eeb57bb4 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14077)
 
   * 8dac276274844f65a48d2e877a3cb1ed1d4ec3e3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14079)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5490) Investigate test failures w/ record level index for existing tests

2023-01-03 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-5490:
--
Description: 
Enable record level index for some of the chosen tests (30 to 40) and ensure 
they succeed. The parameterized tests covered in the jira are.

TestCOWDataSourceStorage, TestSparkDataSource, TestMORDataSourceStorage, 
TestCOWDataSource#testDropInsertDup and 
TestHoodieClientOnCopyOnWriteStorage (Sub tests below for 
TestHoodieClientOnCopyOnWriteStorage)
Auto commit tests
testDeduplicationOnInsert
testDeduplicationOnUpsert
testInsertsWithHoodieConcatHandle
testDeletes
testUpsertsUpdatePartitionPathGlobalBloom
testSmallInsertHandlingForUpserts
testSmallInsertHandlingForInserts
testDeletesWithDeleteApi

  was:
Enable record level index for some of the chosen tests (30 to 40) and ensure 
they succeed. 

 


> Investigate test failures w/ record level index for existing tests
> --
>
> Key: HUDI-5490
> URL: https://issues.apache.org/jira/browse/HUDI-5490
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.13.0
>
>
> Enable record level index for some of the chosen tests (30 to 40) and ensure 
> they succeed. The parameterized tests covered in the jira are.
> TestCOWDataSourceStorage, TestSparkDataSource, TestMORDataSourceStorage, 
> TestCOWDataSource#testDropInsertDup and 
> TestHoodieClientOnCopyOnWriteStorage (Sub tests below for 
> TestHoodieClientOnCopyOnWriteStorage)
> Auto commit tests
> testDeduplicationOnInsert
> testDeduplicationOnUpsert
> testInsertsWithHoodieConcatHandle
> testDeletes
> testUpsertsUpdatePartitionPathGlobalBloom
> testSmallInsertHandlingForUpserts
> testSmallInsertHandlingForInserts
> testDeletesWithDeleteApi



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] IvyIvy1109 commented on issue #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


IvyIvy1109 commented on issue #7594:
URL: https://github.com/apache/hudi/issues/7594#issuecomment-1369635307

   > Hi,
   > 
   > Simple question - Does AWS Athena support Hudi Time Travel?
   > 
   > We see good support for Iceberg tables 
[here](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-table-data.html)
 I see some mention in this 
[ticket](https://github.com/apache/hudi/issues/4502) about it but can you share 
more details about it?
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7593: [HUDI-5492] spark call command show_compaction doesn't return the com…

2023-01-03 Thread GitBox


hudi-bot commented on PR #7593:
URL: https://github.com/apache/hudi/pull/7593#issuecomment-1369614570

   
   ## CI report:
   
   * e7dad5ff4526bd3ce1f93c0a6143f919eeb57bb4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14077)
 
   * 8dac276274844f65a48d2e877a3cb1ed1d4ec3e3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] maddy2u opened a new issue, #7594: [SUPPORT] Hudi Time Travel from Athena

2023-01-03 Thread GitBox


maddy2u opened a new issue, #7594:
URL: https://github.com/apache/hudi/issues/7594

   Hi,
   
   Simple question - Does AWS Athena support Hudi Time Travel?
   
   We see good support for Iceberg tables 
[here](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-table-data.html)
   I see some mention in this 
[ticket](https://github.com/apache/hudi/issues/4502) about it but can you share 
more details about it?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7355: [HUDI-5308] Hive query returns null when the where clause has a partition field

2023-01-03 Thread GitBox


hudi-bot commented on PR #7355:
URL: https://github.com/apache/hudi/pull/7355#issuecomment-1369595486

   
   ## CI report:
   
   * efcb91b1f4a577a016a82bd4a6e0a203d04c251f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14072)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14074)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yabha-isomap commented on issue #7381: [SUPPORT] Dependencies required for running flink hudi quickstart.

2023-01-03 Thread GitBox


yabha-isomap commented on issue #7381:
URL: https://github.com/apache/hudi/issues/7381#issuecomment-1369565883

   Thanks @danny0405 . When I use following dependency
   
   ```xml

org.apache.hadoop
hadoop-common
3.3.4
provided

   ```
   
   Then it doesn't give the error but it doesn't work as expected. I.e. No 
records get written to hudi table.
   
   But instead if I use following dependency then it works as expected.
   
   ```xml

org.apache.hudi
hudi-hadoop-mr-bundle
0.12.1

   ```
   
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yyar commented on issue #7472: [SUPPORT] Too many metadata timeline file caused by old rollback active timeline

2023-01-03 Thread GitBox


yyar commented on issue #7472:
URL: https://github.com/apache/hudi/issues/7472#issuecomment-1369538405

   @yihua Okay. Since I'm using 0.11.1 version, I'll cherry-pick [these two 
commits](https://github.com/apache/hudi/pull/7580/commits) based on 0.11.1 
release version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7573: [HUDI-5484] Avoid using GenericRecord in ColumnStatMetadata

2023-01-03 Thread GitBox


hudi-bot commented on PR #7573:
URL: https://github.com/apache/hudi/pull/7573#issuecomment-1369524807

   
   ## CI report:
   
   * 70796357cee7f956d7ad595f27fa2a8e8524d798 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14030)
 
   * 92b8c60d309978d24aa33badba2cd4d9f0640b18 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14078)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    1   2   3   >