date:20220822

[GitHub] [hudi] hudi-bot commented on pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-22 Thread GitBox



hudi-bot commented on PR #6476:
URL: https://github.com/apache/hudi/pull/6476#issuecomment-1223625596

   
   ## CI report:
   
   * 1fd639d2941b41ea33f076cc249539d34514046d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10893)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6450: [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6450:
URL: https://github.com/apache/hudi/pull/6450#issuecomment-1223625512

   
   ## CI report:
   
   * 3bd700dea82006f1d3081c3eee7ab1b430728911 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4567) Finalize design approach and RFC docs

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4567.

Resolution: Done

> Finalize design approach and RFC docs
> -
>
> Key: HUDI-4567
> URL: https://issues.apache.org/jira/browse/HUDI-4567
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4567) Finalize design approach and RFC docs

2022-08-22 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583401#comment-17583401
 ] 

Raymond Xu commented on HUDI-4567:
--

Done https://github.com/apache/hudi/pull/6256

> Finalize design approach and RFC docs
> -
>
> Key: HUDI-4567
> URL: https://issues.apache.org/jira/browse/HUDI-4567
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-22 Thread GitBox



hudi-bot commented on PR #6476:
URL: https://github.com/apache/hudi/pull/6476#issuecomment-1223621185

   
   ## CI report:
   
   * 1fd639d2941b41ea33f076cc249539d34514046d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6450: [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6450:
URL: https://github.com/apache/hudi/pull/6450#issuecomment-1223621065

   
   ## CI report:
   
   * 3bd700dea82006f1d3081c3eee7ab1b430728911 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223620416

   
   ## CI report:
   
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #6450: [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false

2022-08-22 Thread GitBox



nsivabalan commented on PR #6450:
URL: https://github.com/apache/hudi/pull/6450#issuecomment-1223619767

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan closed pull request #2426: [HUDI-304] Configure spotless and java style

2022-08-22 Thread GitBox



xushiyan closed pull request #2426: [HUDI-304] Configure spotless and java style
URL: https://github.com/apache/hudi/pull/2426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2768) Enable async timeline server by default

2022-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2768:
-
Labels: hudi-on-call pull-request-available  (was: hudi-on-call)

> Enable async timeline server by default
> ---
>
> Key: HUDI-2768
> URL: https://issues.apache.org/jira/browse/HUDI-2768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server, writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.12.1
>
>
> Enable async timeline server by default.
>  
> [https://github.com/apache/hudi/pull/3949]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xushiyan closed pull request #4807: [HUDI-2768] Make async timeline server by default

2022-08-22 Thread GitBox



xushiyan closed pull request #4807: [HUDI-2768] Make async timeline server by 
default
URL: https://github.com/apache/hudi/pull/4807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223616053

   
   ## CI report:
   
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan closed pull request #6040: [HUDI-4322] Deprecate partition value extractor

2022-08-22 Thread GitBox



xushiyan closed pull request #6040: [HUDI-4322] Deprecate partition value 
extractor
URL: https://github.com/apache/hudi/pull/6040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4619) The retry mechanism of remotehoodietablefilesystemview needs to be thread safe

2022-08-22 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4619:
--
Reviewers: sivabalan narayanan  (was: Raymond Xu)

> The retry mechanism of remotehoodietablefilesystemview needs to be thread safe
> --
>
> Key: HUDI-4619
> URL: https://issues.apache.org/jira/browse/HUDI-4619
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: HunterHunter
>Assignee: HunterHunter
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> {code:java}
> Caused by: java.io.FileNotFoundException: File 
> file:/hudi_tbl/2/.-575e-4905-85e4-fb62d29a4bea_20220813205548259.log.1_0-1-0
>  does not exist. {code}
> This error is because the partition of the return value of {{{}fileSystemView
>   .getLatestFileSlices(partitionPath){}}}`  is not equal to 
> `{{{}partitionPath`{}}}
> {code:java}
> return like this : 
> request partitionPath : 6, response slice part : 1, 
> request partitionPath : 1, response slice part : 1,
> request partitionPath : 9, response slice part : 0, 
> request partitionPath : 3, response slice part : 4, 
> request partitionPath : 4, response slice part : 0,
> request partitionPath : 5, response slice part : 0, 
> request partitionPath : 0, response slice part : 0, 
> request partitionPath : 8, response slice part : 8, 
> request partitionPath : 2, response slice part : 4, 
> request partitionPath : 7, response slice part : 5,  {code}
> {{}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11

2022-08-22 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4326:
--
Reviewers: sivabalan narayanan  (was: Raymond Xu)

> Hudi spark datasource error after migrate from 0.8 to 0.11
> --
>
> Key: HUDI-4326
> URL: https://issues.apache.org/jira/browse/HUDI-4326
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Kyle Zhike Chen
>Assignee: Kyle Zhike Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to 
> read a hudi table is not working, the table has been sync to hive metastore 
> and spark is connected to the metastore. the error is
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> ...
> Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
> at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
> at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> After changing the table to the spark data source table, the table SerDeInfo 
> is missing. I created a pull request.
>  
> related GH issue:
> https://github.com/apache/hudi/issues/5861



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4066) HiveMetastoreBasedLockProvider can not release lock when writer fails

2022-08-22 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng reassigned HUDI-4066:
---

Assignee: Jian Feng

> HiveMetastoreBasedLockProvider can not release lock when writer fails
> -
>
> Key: HUDI-4066
> URL: https://issues.apache.org/jira/browse/HUDI-4066
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Affects Versions: 0.10.1
>Reporter: Jian Feng
>Assignee: Jian Feng
>Priority: Critical
> Fix For: 1.0.0
>
>
> we use HiveMetastoreBasedLockProvider in the Prod environment, one writer is 
> ingesting data with Flink, and another writer will delete some old partitions 
> with Spark. sometimes spark job failed, but the lock was not released. then 
> all writers failed.  
> {code:java}
> // error log
> 22/04/01 08:12:18 INFO TransactionManager: Transaction starting without a 
> transaction owner22/04/01 08:12:18 INFO LockManager: LockProvider 
> org.apache.hudi.hive.HiveMetastoreBasedLockProvider22/04/01 08:12:19 INFO 
> metastore: Trying to connect to metastore with URI 
> thrift://10.128.152.245:908322/04/01 08:12:19 INFO metastore: Opened a 
> connection to metastore, current connections: 122/04/01 08:12:19 INFO 
> metastore: Connected to metastore.22/04/01 08:12:20 INFO 
> HiveMetastoreBasedLockProvider: ACQUIRING lock at database dev_video and 
> table dwd_traffic_log22/04/01 08:12:25 INFO TransactionManager: Transaction 
> ending without a transaction owner22/04/01 08:12:25 INFO 
> HiveMetastoreBasedLockProvider: RELEASING lock at database dev_video and 
> table dwd_traffic_log22/04/01 08:12:25 INFO TransactionManager: Transaction 
> ended without a transaction ownerException in thread "main" 
> org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
> object at 
> org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:71) 
>at 
> org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:51)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:430)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.deletePartitions(SparkRDDWriteClient.java:261)
> at 
> org.apache.hudi.DataSourceUtils.doDeletePartitionsOperation(DataSourceUtils.java:234)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:217)   
>  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131) 
>at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:991)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)   
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:991)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
> at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)  
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
> at 
> com.shopee.ci.hudi.tasks.ExpiredPartitionDelete$.$anonfun$main$2(ExpiredPartitionDelete.scala:82)
> at 
> com.shopee.ci.hudi.tasks.ExpiredPartitionDelete$.$anonfun$main$2$adapted(ExpiredPartitionDelete.scala:65)
> at scala.collection.Iterator.foreach(Iter

[jira] [Updated] (HUDI-4582) Sync 11w partitions to hive by using HiveSyncTool with(--sync-mode="hms" and use-jdbc=false) with timeout

2022-08-22 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4582:
--
Reviewers: sivabalan narayanan  (was: Raymond Xu)

> Sync 11w partitions to hive by using HiveSyncTool with(--sync-mode="hms" and 
> use-jdbc=false) with timeout
> -
>
> Key: HUDI-4582
> URL: https://issues.apache.org/jira/browse/HUDI-4582
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: XixiHua
>Assignee: XixiHua
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> when we try to sync 11w partitions to hive by using 
> HiveSyncTool(--sync-mode="hms" and use-jdbc=false)  with timeout error. 
>  
> With https://issues.apache.org/jira/browse/HUDI-2116, this only solved 
> --sync-mode = jdbc with the parameter: HIVE_BATCH_SYNC_PARTITION_NUM, and I 
> want to extend this to hms mode. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4340) DeltaStreamer bootstrap failed when metrics on caused by DateTimeParseException: Text '00000000000001999' could not be parsed

2022-08-22 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4340:
--
Reviewers: sivabalan narayanan  (was: Raymond Xu)

> DeltaStreamer bootstrap failed when metrics on caused by 
> DateTimeParseException: Text '01999' could not be parsed
> -
>
> Key: HUDI-4340
> URL: https://issues.apache.org/jira/browse/HUDI-4340
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, metrics
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: error-deltastreamer.log
>
>
> Found this bug in Hudi integrate test ITTestHoodieDemo.java
> HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS is a invalid value, 
> "01", which can not be parsed by DateTimeFormatter with format 
> SECS_INSTANT_TIMESTAMP_FORMAT = "MMddHHmmss" in method 
> HoodieInstantTimeGenerator.parseDateFromInstantTime
> Error code at 
> org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.parseDateFromInstantTime(HoodieInstantTimeGenerator.java:96)
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java#L100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4431) Fix log file will not roll over to a new file

2022-08-22 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4431:
--
Reviewers: sivabalan narayanan  (was: Raymond Xu)

> Fix log file will not roll over to a new file
> -
>
> Key: HUDI-4431
> URL: https://issues.apache.org/jira/browse/HUDI-4431
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



TengHuo commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223599455

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



TengHuo commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223597840

   Hi @danny0405 
   
   Just pushed a new commit, and updated PR description, please help to review, 
really thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #6477: [DOCS] Change community sync schedule image

2022-08-22 Thread GitBox



bhasudha opened a new pull request, #6477:
URL: https://github.com/apache/hudi/pull/6477

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6467: [HUDI-4686] Flip option 'write.ignore.failed' to default false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6467:
URL: https://github.com/apache/hudi/pull/6467#issuecomment-1223562209

   
   ## CI report:
   
   * 23b77552e300ca697b142ebe687cf2a8b4452bfa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10891)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223561595

   
   ## CI report:
   
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-22 Thread GitBox



xushiyan commented on PR #6256:
URL: https://github.com/apache/hudi/pull/6256#issuecomment-1223535084

   > @xushiyan - Do you want to take a final stab at the RFC incorporating all 
the changes? I believe we have consensus on what needs to be done. Please 
correct me if I am wrong. cc @YannByron @danny0405
   
   @prasannarajaperumal I've updated the RFC as per the recent discussions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #5885: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-22 Thread GitBox



YannByron commented on PR #5885:
URL: https://github.com/apache/hudi/pull/5885#issuecomment-1223521172

   Reopen: https://github.com/apache/hudi/pull/6476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223518699

   
   ## CI report:
   
   * 3782118990698553ac6121b49641e79e01407353 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10802)
 
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1223515434

   
   ## CI report:
   
   * 3782118990698553ac6121b49641e79e01407353 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10802)
 
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hehuiyuan commented on pull request #6392: [HUDI-4618][common]Separate log word for CommitUitls class

2022-08-22 Thread GitBox



hehuiyuan commented on PR #6392:
URL: https://github.com/apache/hudi/pull/6392#issuecomment-1223493994

   @danny0405  hi, take a look when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-4677) Snapshot view management

2022-08-22 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng reassigned HUDI-4677:
---

Assignee: Jian Feng

> Snapshot view management
> 
>
> Key: HUDI-4677
> URL: https://issues.apache.org/jira/browse/HUDI-4677
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Jian Feng
>Assignee: Jian Feng
>Priority: Major
> Attachments: image-2022-08-22-02-03-31-588.png
>
>
>  !image-2022-08-22-02-03-31-588.png! 
> for the snapshot view scenario, Hudi already provides two key features to 
> support it:
> Time travel: user provides a timestamp to query a specific snapshot view of a 
> Hudi table
> Savepoint/restore: "savepoint" saves the table as of the commit time so that 
> it lets you restore the table to this savepoint at a later point in time if 
> need be. but in this case, the user usually uses this to prevent cleaning 
> snapshot view at a specific timestamp, only clean unused files
> The situation is there some inconvenience for users if use them directly
> Usually users incline to use a meaningful name instead of querying Hudi table 
> with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
> view being used. for example, we can announce that a new tag of hudi table 
> with table_nameMMDD was released, then the user can use this new table 
> name to query.
> Savepoint is not designed for this "snapshot view" scenario in the beginning, 
> it is designed for disaster recovery. let's say a new snapshot view will be 
> created every day, and it has 7 days retention, we should support lifecycle 
> management on top of it.
> What I plan to do is to let Hudi support release a snapshot view and 
> lifecycle management out-of-box. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] dwshmilyss closed issue #6470: [SUPPORT]SHOW PARTITIONS is not allowed on hudi table since its partition metadata is not stored in the Hive metastore

2022-08-22 Thread GitBox



dwshmilyss closed issue #6470: [SUPPORT]SHOW PARTITIONS is not allowed on hudi 
table since its partition metadata is not stored in the Hive metastore
URL: https://github.com/apache/hudi/issues/6470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] awsUser123 opened a new issue, #6475: [SUPPORT]

2022-08-22 Thread GitBox



awsUser123 opened a new issue, #6475:
URL: https://github.com/apache/hudi/issues/6475

   Hey guys, I am trying to implement reading from kinesis data streams and 
storing it into an s3 bucket using Hudi.
   I was able to add the data into s3 by referring and running the following 
code-  
https://github.com/awsalialem/amazon-kinesis-data-analytics-java-examples/blob/master/S3Sink/src/main/java/com/amazonaws/services/kinesisanalytics/S3StreamingSinkJob.java
   
   I wanted to know how I can further store the data into S3 using Hudi 
connector while reading from kinesis data streams. Is there any modification I 
can do to the code?
   
   I wrote the following code but this fails the Flink job -
   
   ` import lombok.NonNull;
import 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.api.java.tuple.Tuple2;
import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.JsonNode;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.windowing.time.Time;
import 
org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import 
org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import 
org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.core.fs.Path;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.configuration.Configuration;
import java.util.Arrays;
import java.util.Properties;

/**
 * Entry point into the NZTIlStream Flink application.
 *
 */
@Log4j2
public final class Streaming {
/**
 * Private constructor.
 */
private Streaming() {
throw new UnsupportedOperationException("Creating an instance is 
not allowed!");
}

private static final String REGION = "us-west-2";
private static final String INPUTSTREAMNAME = "stream_name";
private static final String S3SINKPATH = "s3://ka-app-bucketname/data";
   

//creating kinesis data streams as the source
private static DataStream 
createSourceFromStaticConfig(StreamExecutionEnvironment env) {
Properties inputProperties = new Properties();
inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, 
REGION);

inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, 
"LATEST");

return env.addSource(new FlinkKinesisConsumer<>(INPUTSTREAMNAME, 
new SimpleStringSchema(), inputProperties));
}
//creating S3 bucket as the sink
private static StreamingFileSink createS3SinkFromStaticConfig() 
{

final StreamingFileSink sink = StreamingFileSink
.forRowFormat(new Path(S3SINKPATH), new 
SimpleStringEncoder("UTF-8"))
.build();
return sink;
}

   
/**
 * Main method.
 *
 * @param args the cli args used
 * @throws Exception when the jobs fails to execute
 */
public static void main(@NonNull final String[] args) throws Exception {

// set up the streaming execution environment
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();

DataStream input = createSourceFromStaticConfig(env);

ObjectMapper jsonParser = new ObjectMapper();

input.map(value -> { // Parse the JSON
JsonNode jsonNode = jsonParser.readValue(value, 
JsonNode.class);
return new Tuple2<>(jsonNode.get("TICKER").toString(), 
1);
}).returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0) // Logically partition the stream for each word
.window(TumblingProcessingTimeWindows.of(Time.minutes(1))) 
// Flink 1.13
.sum(1) // Count the appearances by ticker per partition
.map(value -> value.f0 + " count: " + value.f1.toString() + 
"\n")
.addSink(createS3SinkFromStaticConfig());


env.execute("Flink S3 Streaming Sink Job");
KinesisHudiSqlExample.createAndDeployJob(env, S3SINKPATH, 
INPUTSTREAMNAME, REGION);
}

public static class KinesisHudiSqlExample {

public static void createAndDeployJob(StreamExecutionEnviro

[GitHub] [hudi] LinMingQiang commented on issue #5330: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2022-08-22 Thread GitBox



LinMingQiang commented on issue #5330:
URL: https://github.com/apache/hudi/issues/5330#issuecomment-1223475113

   see https://github.com/apache/hudi/pull/5763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6467: [HUDI-4686] Flip option 'write.ignore.failed' to default false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6467:
URL: https://github.com/apache/hudi/pull/6467#issuecomment-1223473325

   
   ## CI report:
   
   * e9b3607a806759544aa333ac256cdf95e5434ce3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10872)
 
   * 23b77552e300ca697b142ebe687cf2a8b4452bfa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10891)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6467: [HUDI-4686] Flip option 'write.ignore.failed' to default false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6467:
URL: https://github.com/apache/hudi/pull/6467#issuecomment-1223470017

   
   ## CI report:
   
   * e9b3607a806759544aa333ac256cdf95e5434ce3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10872)
 
   * 23b77552e300ca697b142ebe687cf2a8b4452bfa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



TengHuo commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r952099366


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -80,6 +80,15 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * This method will mute ParseException if gaven these timestamp and return 
a corresponding Date
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},

Review Comment:
   Got it, let me fix this issue and the same in #3774 in this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-4676) infer cleaner policy when write concurrency mode is OCC

2022-08-22 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng resolved HUDI-4676.
-

>  infer cleaner policy when write concurrency mode is OCC
> 
>
> Key: HUDI-4676
> URL: https://issues.apache.org/jira/browse/HUDI-4676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jian Feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4676) infer cleaner policy when write concurrency mode is OCC

2022-08-22 Thread Jian Feng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng reassigned HUDI-4676:
---

Assignee: Jian Feng

>  infer cleaner policy when write concurrency mode is OCC
> 
>
> Key: HUDI-4676
> URL: https://issues.apache.org/jira/browse/HUDI-4676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jian Feng
>Assignee: Jian Feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] dwshmilyss commented on issue #6470: [SUPPORT]SHOW PARTITIONS is not allowed on hudi table since its partition metadata is not stored in the Hive metastore

2022-08-22 Thread GitBox



dwshmilyss commented on issue #6470:
URL: https://github.com/apache/hudi/issues/6470#issuecomment-1223449366

   @Zouxxyy thanks for your advise,I found that this problem was caused by the 
conflict between SPARK3.2 and HUDI 0.11.1. In HUDI 0.11.1, HiveSyncTool. 
GetSparkTableProperties () the spark. SQL. Sources. The provider set to hudi, 
Spark3.2 does not support show Parititons for tables whose provider is HUDI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4384) Hive style partition not work and record key loss prefix using ComplexKey in bulk_insert

2022-08-22 Thread Teng Huo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583296#comment-17583296
 ] 

Teng Huo commented on HUDI-4384:


Got it, np. Thanks [~xushiyan]

> Hive style partition not work and record key loss prefix using ComplexKey in 
> bulk_insert
> 
>
> Key: HUDI-4384
> URL: https://issues.apache.org/jira/browse/HUDI-4384
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Teng Huo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: Screenshot 2022-07-12 at 16.39.58.png, Screenshot 
> 2022-07-12 at 17.39.15.png
>
>
> When using bulk_insert in 0.11.1, 
> "hoodie.datasource.write.hive_style_partitioning" won't work
> When using "org.apache.hudi.keygen.ComplexKeyGenerator", there is no prefix 
> in column "_hoodie_record_key"
> There is a Gitlab issue reported: https://github.com/apache/hudi/issues/6070
> And we can reproduce this bug with code
> {code:java}
>   def main(args: Array[String]): Unit = {
> val avroSchema = new Schema.Parser().parse(new 
> File("~/hudi/docker/demo/config/schema.avsc"))
> val schema = 
> SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
> val df = 
> spark.read.schema(schema).json("file://~/hudi/docker/demo/data/batch_1.json")
> val options = Map (
>   "hoodie.datasource.write.keygenerator.class" -> 
> "org.apache.hudi.keygen.ComplexKeyGenerator",
>   "hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
>   "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE",
>   "hoodie.datasource.write.precombine.field" -> "ts",
>   "hoodie.datasource.write.recordkey.field" -> "key",
>   "hoodie.datasource.write.partitionpath.field" -> "year",
>   "hoodie.datasource.write.hive_style_partitioning" -> "true",
>   "hoodie.datasource.hive_sync.enable" -> "false",
>   "hoodie.datasource.hive_sync.partition_fields" -> "year",
>   "hoodie.datasource.hive_sync.partition_extractor_class" -> 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor"
> )
> bulkInsert(df, options)
> insert(df, options)
>   }
>   def bulkInsert(df: DataFrame, options: Map[String, String]): Unit = {
> val allOptions: Map[String, String] = options ++ Map (
>   "hoodie.datasource.write.operation" -> "bulk_insert",
>   "hoodie.table.name" -> "test_hudi_bulk_table"
> )
> df.write.format("hudi")
>   .options(allOptions)
>   .mode(SaveMode.Overwrite)
>   .save("file://~/test_hudi_bulk_table")
>   }
>   def insert(df: DataFrame, options: Map[String, String]): Unit = {
> val allOptions: Map[String, String] = options ++ Map (
>   "hoodie.datasource.write.operation" -> "insert",
>   "hoodie.table.name" -> "test_hudi_insert_table"
> )
> df.write.format("hudi")
>   .options(allOptions)
>   .mode(SaveMode.Overwrite)
>   .save("file://~/test_hudi_insert_table")
>   }
> {code}
> The data written in file showing as below
>  !Screenshot 2022-07-12 at 16.39.58.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-22 Thread GitBox



danny0405 commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r952087908


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -80,6 +80,15 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * This method will mute ParseException if gaven these timestamp and return 
a corresponding Date
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},

Review Comment:
   +1, that's my initial idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table

2022-08-22 Thread GitBox



danny0405 commented on PR #6456:
URL: https://github.com/apache/hudi/pull/6456#issuecomment-1223440395

   @alexeykudinkin Can you help take a look here ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table

2022-08-22 Thread GitBox



danny0405 commented on PR #6456:
URL: https://github.com/apache/hudi/pull/6456#issuecomment-1223438615

   > Sparksql.then we'll see the table inputFormat is 
HoodieParquetRealtimeInputFormat
   
   Thanks, we may need to figure out why the Spark sql uses 
`HoodieParquetRealtimeInputFormat` as default instead, i guess it tries to 
expose the full snapshot query view by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4683) Use enum class value for default value in flink options

2022-08-22 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4683:
-
Summary: Use enum class value for default value in flink options  (was: Fix 
to use enum class value for default value in flinkoptions)

> Use enum class value for default value in flink options
> ---
>
> Key: HUDI-4683
> URL: https://issues.apache.org/jira/browse/HUDI-4683
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> ### Change Logs
> 1. Use Enum Class instead of constant string for option default value in 
> FlinkOptions.
> 2. fix doc error for default value
> !https://user-images.githubusercontent.com/18002496/185736834-81825dc6-9c4f-4633-8783-fb52239ce6a2.png|width=692,height=116!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4683) Fix to use enum class value for default value in flinkoptions

2022-08-22 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4683:
-
Fix Version/s: 0.12.1

> Fix to use enum class value for default value in flinkoptions
> -
>
> Key: HUDI-4683
> URL: https://issues.apache.org/jira/browse/HUDI-4683
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> ### Change Logs
> 1. Use Enum Class instead of constant string for option default value in 
> FlinkOptions.
> 2. fix doc error for default value
> !https://user-images.githubusercontent.com/18002496/185736834-81825dc6-9c4f-4633-8783-fb52239ce6a2.png|width=692,height=116!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4683) Fix to use enum class value for default value in flinkoptions

2022-08-22 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4683.
--

> Fix to use enum class value for default value in flinkoptions
> -
>
> Key: HUDI-4683
> URL: https://issues.apache.org/jira/browse/HUDI-4683
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
>
> ### Change Logs
> 1. Use Enum Class instead of constant string for option default value in 
> FlinkOptions.
> 2. fix doc error for default value
> !https://user-images.githubusercontent.com/18002496/185736834-81825dc6-9c4f-4633-8783-fb52239ce6a2.png|width=692,height=116!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4683) Fix to use enum class value for default value in flinkoptions

2022-08-22 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583294#comment-17583294
 ] 

Danny Chen commented on HUDI-4683:
--

Fixed via master branch: c677333f26aaa4dc880a04b7532929b68bd978ed

> Fix to use enum class value for default value in flinkoptions
> -
>
> Key: HUDI-4683
> URL: https://issues.apache.org/jira/browse/HUDI-4683
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> ### Change Logs
> 1. Use Enum Class instead of constant string for option default value in 
> FlinkOptions.
> 2. fix doc error for default value
> !https://user-images.githubusercontent.com/18002496/185736834-81825dc6-9c4f-4633-8783-fb52239ce6a2.png|width=692,height=116!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (4966978a55 -> c677333f26)

2022-08-22 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 4966978a55 [HUDI-4676] infer cleaner policy when write concurrency 
mode is OCC (#6459)
 add c677333f26 [HUDI-4683] Use enum class value for default value in flink 
options (#6453)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/configuration/FlinkOptions.java| 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #6453: [HUDI-4683] Use enum class value for default value in flink options

2022-08-22 Thread GitBox



danny0405 merged PR #6453:
URL: https://github.com/apache/hudi/pull/6453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #4665: [HUDI-2733] Add support for Thrift sync

2022-08-22 Thread GitBox



xushiyan commented on PR #4665:
URL: https://github.com/apache/hudi/pull/4665#issuecomment-1223435214

   @stym06 any chance you can rebase and update this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-22 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1223422846

   
   ## CI report:
   
   * 520b1b54c37ce6378047a55f650e309d6feb89d1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-3806) Improve HoodieBloomIndex using bloom_filter and col_stats in MDT

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3806.

Resolution: Duplicate

> Improve HoodieBloomIndex using bloom_filter and col_stats in MDT
> 
>
> Key: HUDI-3806
> URL: https://issues.apache.org/jira/browse/HUDI-3806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> For a Delastreamer job doing bulk inserts of 10GB batches, the job is stuck 
> at the stage when HoodieBloomIndex reads bloom filter from the metadata 
> table, taking more than 2 hours.  When bloom filter is disabled in metadata 
> table, each commit takes 10-20 minutes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4585:
-
Story Points: 10

> Optimize  query performance on Presto Hudi connector
> 
>
> Key: HUDI-4585
> URL: https://issues.apache.org/jira/browse/HUDI-4585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4586:
-
Story Points: 5

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(Trav

[jira] [Updated] (HUDI-3806) Improve HoodieBloomIndex using bloom_filter and col_stats in MDT

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3806:
-
Story Points: 0  (was: 4)

> Improve HoodieBloomIndex using bloom_filter and col_stats in MDT
> 
>
> Key: HUDI-3806
> URL: https://issues.apache.org/jira/browse/HUDI-3806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> For a Delastreamer job doing bulk inserts of 10GB batches, the job is stuck 
> at the stage when HoodieBloomIndex reads bloom filter from the metadata 
> table, taking more than 2 hours.  When bloom filter is disabled in metadata 
> table, each commit takes 10-20 minutes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4586:
-
Status: Patch Available  (was: In Progress)

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collect

[jira] [Assigned] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-1369:


Assignee: Ethan Guo  (was: Wenning Ding)

> Bootstrap datasource jobs from hanging via spark-submit
> ---
>
> Key: HUDI-1369
> URL: https://issues.apache.org/jira/browse/HUDI-1369
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> MOR table creation via Hudi datasource hangs at the end of the spark-submit 
> job.
> Looks like {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  not being closed which does not stop the timeline server at the end, and as 
> a result the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4125) Add IT (Azure CI) around bootstrapped Hudi table

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4125:
-
Summary: Add IT (Azure CI) around bootstrapped Hudi table  (was: Add 
integration tests around bootstrapped Hudi table)

> Add IT (Azure CI) around bootstrapped Hudi table
> 
>
> Key: HUDI-4125
> URL: https://issues.apache.org/jira/browse/HUDI-4125
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> For bootstrapped Hudi table with bootstrap format, the table can be queried 
> through different engines without any issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-3495) Reading keys in parallel from HoodieMetadataMergedLogRecordReader may lead to empty results even if key exists

2022-08-22 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583290#comment-17583290
 ] 

Raymond Xu commented on HUDI-3495:
--

[~guoyihua]: to be verified before closing

> Reading keys in parallel from HoodieMetadataMergedLogRecordReader may lead to 
> empty results even if key exists
> --
>
> Key: HUDI-3495
> URL: https://issues.apache.org/jira/browse/HUDI-3495
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Prashant Wason
>Assignee: Yue Zhang
>Priority: Blocker
> Fix For: 0.12.1
>
>
> [HoodieMetadataMergedLogRecordReader|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMergedLogRecordReader.java]
>  has two function which lookup keys:
> getRecordByKey(String key) - lookups the key in member variable map "records"
> getRecordsByKeys(List keys) - clears member variable map "records" 
> and scans the log files again.
> If the two functions are called in parallel, the getRecordByKey() may return 
> an empty key because the records was cleared in another thread calling 
> getRecordsByKeys()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3495) Reading keys in parallel from HoodieMetadataMergedLogRecordReader may lead to empty results even if key exists

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3495:
-
Sprint:   (was: 2022/09/19)

> Reading keys in parallel from HoodieMetadataMergedLogRecordReader may lead to 
> empty results even if key exists
> --
>
> Key: HUDI-3495
> URL: https://issues.apache.org/jira/browse/HUDI-3495
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Prashant Wason
>Assignee: Yue Zhang
>Priority: Blocker
> Fix For: 0.12.1
>
>
> [HoodieMetadataMergedLogRecordReader|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMergedLogRecordReader.java]
>  has two function which lookup keys:
> getRecordByKey(String key) - lookups the key in member variable map "records"
> getRecordsByKeys(List keys) - clears member variable map "records" 
> and scans the log files again.
> If the two functions are called in parallel, the getRecordByKey() may return 
> an empty key because the records was cleared in another thread calling 
> getRecordsByKeys()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3777) Optimize column stats storage

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3777:
-
Sprint:   (was: 2022/09/19)

> Optimize column stats storage
> -
>
> Key: HUDI-3777
> URL: https://issues.apache.org/jira/browse/HUDI-3777
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Avoid storing filename of each record in the colstats partition.
> As of now, we store fileName as part of value in Col stats entries. This 
> results in more storage, but comes w/ ease of getting everything in 1 look 
> up. But as you could see, file name is repeated in every entries' value. And 
> since its UUID based, each file name is going to add 70 bytes to each entry. 
> For eg, 
> lets say we have a table with 1000 columns. 1000 partitions. with each 
> partition having 10k files. 
> Total entries in col stats partition = 1000*1000*1 = 10^10. 10B records. 
> So, thats ~ 70GB. 
> where in, if we can come up with a mapping of a unique Id for every filename, 
> and store the mapping elsewhere (like FILES partition), we need only 8 bytes 
> per entry. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3777) Optimize column stats storage

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3777:
-
Description: 
Avoid storing filename of each record in the colstats partition.


As of now, we store fileName as part of value in Col stats entries. This 
results in more storage, but comes w/ ease of getting everything in 1 look up. 
But as you could see, file name is repeated in every entries' value. And since 
its UUID based, each file name is going to add 70 bytes to each entry. 

For eg, 
lets say we have a table with 1000 columns. 1000 partitions. with each 
partition having 10k files. 

Total entries in col stats partition = 1000*1000*1 = 10^10. 10B records. 
So, thats ~ 70GB. 

where in, if we can come up with a mapping of a unique Id for every filename, 
and store the mapping elsewhere (like FILES partition), we need only 8 bytes 
per entry. 

  was:Avoid storing filename of each record in the colstats partition.


> Optimize column stats storage
> -
>
> Key: HUDI-3777
> URL: https://issues.apache.org/jira/browse/HUDI-3777
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Avoid storing filename of each record in the colstats partition.
> As of now, we store fileName as part of value in Col stats entries. This 
> results in more storage, but comes w/ ease of getting everything in 1 look 
> up. But as you could see, file name is repeated in every entries' value. And 
> since its UUID based, each file name is going to add 70 bytes to each entry. 
> For eg, 
> lets say we have a table with 1000 columns. 1000 partitions. with each 
> partition having 10k files. 
> Total entries in col stats partition = 1000*1000*1 = 10^10. 10B records. 
> So, thats ~ 70GB. 
> where in, if we can come up with a mapping of a unique Id for every filename, 
> and store the mapping elsewhere (like FILES partition), we need only 8 bytes 
> per entry. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4669) Incorrect protoc executable in kafka-connect fails build on Mac M1

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4669:
-
Sprint: 2022/09/19

> Incorrect protoc executable in kafka-connect fails build on Mac M1 
> ---
>
> Key: HUDI-4669
> URL: https://issues.apache.org/jira/browse/HUDI-4669
> Project: Apache Hudi
>  Issue Type: Task
>  Components: connectors
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Executing HUDI build on Arm64 (Macbook M1 Pro/Max) fails when building 
> hudi-kafka-connect module:
>  
> {code:java}
> [*INFO*] 
> 
> [*ERROR*] Failed to execute goal 
> com.github.os72:protoc-jar-maven-plugin:3.11.4:run (default) on project 
> hudi-kafka-connect: Error extracting protoc for version 3.11.4: Unsupported 
> platform: protoc-3.11.4-osx-aarch_64.exe -> [Help 1] {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (3adb571531 -> 4966978a55)

2022-08-22 Thread forwardxu

This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 3adb571531 [HUDI-4678] Claim RFC-61 for Snapshot view management 
(#6461)
 add 4966978a55 [HUDI-4676] infer cleaner policy when write concurrency 
mode is OCC (#6459)

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hudi/config/HoodieCleanConfig.java| 10 ++
 1 file changed, 10 insertions(+)

[GitHub] [hudi] XuQianJin-Stars merged pull request #6459: [HUDI-4676] infer cleaner policy when write concurrency mode is OCC

2022-08-22 Thread GitBox



XuQianJin-Stars merged PR #6459:
URL: https://github.com/apache/hudi/pull/6459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3519) Make sure every public Hudi Client Method invokes necessary prologue

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3519:
-
Component/s: code-quality

> Make sure every public Hudi Client Method invokes necessary prologue
> 
>
> Key: HUDI-3519
> URL: https://issues.apache.org/jira/browse/HUDI-3519
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
> Right now, only a handful of operations actually invoke the "prologue" method 
> doing, for ex
>  # Checks around whether the table needs to be upgraded
>  # Bootstraps MDT (if necessary)
> As well as some other minor book-keeping stuff. As part of 
> [https://github.com/apache/hudi/pull/4739,] i had to address that and 
> introduced universal method `initTable` that serves as such prologue.
> However, while i've injected it into most major public methods of the Hudi 
> Client's Base class, we need to carefully and holistically review all 
> remaining exposed *public* methods and make sure that all _public-facing_ 
> operations (insert, upsert, commit, delete, rollback, clean, etc) are 
> invoking prologue properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3519) Make sure every public Hudi Client Method invokes necessary prologue

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3519:
-
Priority: Major  (was: Blocker)

> Make sure every public Hudi Client Method invokes necessary prologue
> 
>
> Key: HUDI-3519
> URL: https://issues.apache.org/jira/browse/HUDI-3519
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Right now, only a handful of operations actually invoke the "prologue" method 
> doing, for ex
>  # Checks around whether the table needs to be upgraded
>  # Bootstraps MDT (if necessary)
> As well as some other minor book-keeping stuff. As part of 
> [https://github.com/apache/hudi/pull/4739,] i had to address that and 
> introduced universal method `initTable` that serves as such prologue.
> However, while i've injected it into most major public methods of the Hudi 
> Client's Base class, we need to carefully and holistically review all 
> remaining exposed *public* methods and make sure that all _public-facing_ 
> operations (insert, upsert, commit, delete, rollback, clean, etc) are 
> invoking prologue properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3519) Make sure every public Hudi Client Method invokes necessary prologue

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3519:
-
Sprint:   (was: 2022/09/19)

> Make sure every public Hudi Client Method invokes necessary prologue
> 
>
> Key: HUDI-3519
> URL: https://issues.apache.org/jira/browse/HUDI-3519
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Right now, only a handful of operations actually invoke the "prologue" method 
> doing, for ex
>  # Checks around whether the table needs to be upgraded
>  # Bootstraps MDT (if necessary)
> As well as some other minor book-keeping stuff. As part of 
> [https://github.com/apache/hudi/pull/4739,] i had to address that and 
> introduced universal method `initTable` that serves as such prologue.
> However, while i've injected it into most major public methods of the Hudi 
> Client's Base class, we need to carefully and holistically review all 
> remaining exposed *public* methods and make sure that all _public-facing_ 
> operations (insert, upsert, commit, delete, rollback, clean, etc) are 
> invoking prologue properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3301) MergedLogRecordReader inline reading should be stateless and thread safe

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3301:
-
Sprint:   (was: 2022/09/19)

> MergedLogRecordReader inline reading should be stateless and thread safe
> 
>
> Key: HUDI-3301
> URL: https://issues.apache.org/jira/browse/HUDI-3301
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today 
> alters instance member fields and not thread safe.
>  
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader 
> doesn't do full read of log and base files and doesn't fill in the 
> ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will 
> re-read the log and base files by design. But the issue here is this reading 
> alters the instance members and the filled in records are relevant only for 
> that request. Any concurrent getRecordsByKeys() is also modifying the member 
> variable leading to NPE.
>  
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized 
> method has been pushed to master. But this fix doesn't solve all usecases. We 
> need to make the whole class stateless and thread safe for inline reading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3301) MergedLogRecordReader inline reading should be stateless and thread safe

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3301:
-
Priority: Major  (was: Blocker)

> MergedLogRecordReader inline reading should be stateless and thread safe
> 
>
> Key: HUDI-3301
> URL: https://issues.apache.org/jira/browse/HUDI-3301
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today 
> alters instance member fields and not thread safe.
>  
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader 
> doesn't do full read of log and base files and doesn't fill in the 
> ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will 
> re-read the log and base files by design. But the issue here is this reading 
> alters the instance members and the filled in records are relevant only for 
> that request. Any concurrent getRecordsByKeys() is also modifying the member 
> variable leading to NPE.
>  
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized 
> method has been pushed to master. But this fix doesn't solve all usecases. We 
> need to make the whole class stateless and thread safe for inline reading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2022-08-22 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583284#comment-17583284
 ] 

Raymond Xu commented on HUDI-3300:
--

[~guoyihua] : can be closed after verification

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, timeline-server
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3300:
-
Sprint:   (was: 2022/09/19)

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, timeline-server
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3300:
-
Priority: Major  (was: Blocker)

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, timeline-server
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3300:
-
Story Points: 0  (was: 2)

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, timeline-server
>Reporter: Manoj Govindassamy
>Assignee: Yue Zhang
>Priority: Major
> Fix For: 0.12.1
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3453) Metadata table throws NPE when scheduling compaction plan

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3453:
-
Sprint:   (was: 2022/08/22)

> Metadata table throws NPE when scheduling compaction plan
> -
>
> Key: HUDI-3453
> URL: https://issues.apache.org/jira/browse/HUDI-3453
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Assignee: Yue Zhang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieException: Error occurs when 
> executing flatMap
>   at 
> org.apache.hudi.common.function.FunctionWrapper.lambda$throwingFlatMapWrapper$1(FunctionWrapper.java:50)
>   at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
>   at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
>   at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
>   at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>   at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>   at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
>   at 
> java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)
>   at 
> java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
>   at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
>   at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
>   at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>   at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.client.common.HoodieFlinkEngineContext.flatMap(HoodieFlinkEngineContext.java:136)
>   at 
> org.apache.hudi.table.action.compact.HoodieCompactor.generateCompactionPlan(HoodieCompactor.java:263)
>   at 
> org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.scheduleCompaction(ScheduleCompactionActionExecutor.java:122)
>   at 
> org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute(ScheduleCompactionActionExecutor.java:92)
>   at 
> org.apache.hudi.table.HoodieFlinkMergeOnReadTable.scheduleCompaction(HoodieFlinkMergeOnReadTable.java:109)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleTableServiceInternal(AbstractHoodieWriteClient.java:1100)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleTableService(AbstractHoodieWriteClient.java:1083)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleCompactionAtInstant(AbstractHoodieWriteClient.java:850)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleCompaction(AbstractHoodieWriteClient.java:841)
>   at 
> org.apache.hudi.util.CompactionUtil.scheduleCompaction(CompactionUtil.java:64)
>   at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:229)
>   at 
> org.apache.hudi.sink.utils.NonThrownExecutor.lambda$execute$0(NonThrownExecutor.java:93)
>   ... 3 more
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve files in partition 
> oss://datalake-huifu/hudi/poc/ods/pnrweb_prod/trans_log/20220216 from metadata
>   at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:124)
>   at 
> org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:65)
>   at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:304)
>   at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
>   at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:295)
>   at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestFileSlices(AbstractTableFileSystemView.java:578)
>   at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:80)
>   at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestFileSlices(PriorityBasedFileSystemView.java:170)
>   at 
> org.apache.hudi.table.action.compact.HoodieCompactor.lambda$generateCompactionPlan$30498406$1(HoodieCompactor.java:264)
>   at

[jira] [Closed] (HUDI-1461) Bulk insert v2 creates additional small files

2022-08-22 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-1461.
-
Resolution: Duplicate

> Bulk insert v2 creates additional small files
> -
>
> Key: HUDI-1461
> URL: https://issues.apache.org/jira/browse/HUDI-1461
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: Wenning Ding
>Priority: Major
>
> I took a look at the data preparation step for bulk insert, I found that 
> current logic will create additional small files when performing bulk insert 
> v2 which will hurt the performance.
> Current logic is to first sort the input dataframe and then do coalesce: 
> [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106]
> For example, we set BulkInsertShuffleParallelism to 2 and have the following 
> df as input:
> {code:java}
> val df = Seq(
>   (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
>   (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
>   (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> {code}
> (Here I added a new column partitionID for better understanding) Based on the 
> current logic, after sorting and coalesce, the dataframe would become:
> {code:java}
> val df2 = df.sort(functions.col("event_type"), 
> functions.col("event_id")).coalesce(2)
> df2.withColumn("partitionID", spark_partition_id).show(false)
> ++--+---+--+---+
> |event_id|event_name|event_ts   |event_type|partitionID|
> ++--+---+--+---+
> |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
> |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
> |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0  |
> |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0  |
> |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1  |
> |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
> |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
> ++--+---+--+---+
> {code}
> You can see the coalesce result actually does not depend on the sorting 
> result. Each spark partition id contains 3 types of Hudi partitions.
> So during the writing phase, each spark executor would get its corresponding 
> partition id, and each executor would create 3 files under 3 Hudi partitions. 
> Finally we have two parquet files under each Hudi partition. But with such a 
> small dataset, ideally we should have single file under each Hudi partition.
> If I change the sort to repartition:
> {code:java}
> val df3 = df.repartition(functions.col("event_type")).coalesce(2)
> df3.withColumn("partitionID", spark_partition_id).show(false)
> ++--+---+--+---+
> |event_id|event_name|event_ts   |event_type|partitionID|
> ++--+---+--+---+
> |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0  |
> |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0  |
> |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0  |
> |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1  |
> |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1  |
> |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1  |
> |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1  |
> ++--+---+--+---+
> {code}
> In this case, we can have single file under each Hudi partition.
>  
> But according to our understanding, we still need the sort part so that we 
> can get benefit from min/max record key index. So the problem is how should 
> we correctly handle the logic.
> Repartition and sort within each partition might be a way? Though sort within 
> each partition might cause OOM issue if the data is unbalance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3636:
-
Reviewers: sivabalan narayanan

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)
>     at 
> org.apache.hudi.executio

[jira] [Updated] (HUDI-4637) Release thread in RateLimiter is not terminated

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4637:
-
Reviewers: sivabalan narayanan

> Release thread in RateLimiter is not terminated
> ---
>
> Key: HUDI-4637
> URL: https://issues.apache.org/jira/browse/HUDI-4637
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: xi chaomin
>Assignee: xi chaomin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When I use hbase index, I find the job can't be finished. I set log level to 
> DEBUG and see endless printing 
> {code:java}
> 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 100, 
> available: 100
> 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 1000, 
> available: 1000 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4635:
-
Reviewers: Raymond Xu

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4326:
-
Reviewers: Raymond Xu

> Hudi spark datasource error after migrate from 0.8 to 0.11
> --
>
> Key: HUDI-4326
> URL: https://issues.apache.org/jira/browse/HUDI-4326
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Kyle Zhike Chen
>Assignee: Kyle Zhike Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to 
> read a hudi table is not working, the table has been sync to hive metastore 
> and spark is connected to the metastore. the error is
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> ...
> Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
> at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
> at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> After changing the table to the spark data source table, the table SerDeInfo 
> is missing. I created a pull request.
>  
> related GH issue:
> https://github.com/apache/hudi/issues/5861



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4327) TestHoodieDeltaStreamer#testCleanerDeleteReplacedDataWithArchive is flaky

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4327:


Assignee: sivabalan narayanan

> TestHoodieDeltaStreamer#testCleanerDeleteReplacedDataWithArchive is flaky
> -
>
> Key: HUDI-4327
> URL: https://issues.apache.org/jira/browse/HUDI-4327
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci, timeline-server
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4695) Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 expected: <4> but was: <5>

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4695:


Assignee: sivabalan narayanan

> Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 
> expected: <4> but was: <5>
> --
>
> Key: HUDI-4695
> URL: https://issues.apache.org/jira/browse/HUDI-4695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10841&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4696:


Assignee: Raymond Xu

> Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

> 
>
> Key: HUDI-4696
> URL: https://issues.apache.org/jira/browse/HUDI-4696
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4695) Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 expected: <4> but was: <5>

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4695:
-
Description: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10841&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7

> Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 
> expected: <4> but was: <5>
> --
>
> Key: HUDI-4695
> URL: https://issues.apache.org/jira/browse/HUDI-4695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10841&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4438) Fix flaky TestCopyOnWriteActionExecutor.testPartitionMetafileFormat test

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4438:
-
Sprint:   (was: 2022/09/05)

> Fix flaky TestCopyOnWriteActionExecutor.testPartitionMetafileFormat test
> 
>
> Key: HUDI-4438
> URL: https://issues.apache.org/jira/browse/HUDI-4438
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
> It's very flaky and fails periodically for the re-runs of the same build:
> Succeeding build:
> [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10106&view=results]
> Failing build (right after it):
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10150&view=logs&j=600e7de6-e133-5e69-e615-
>  50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] gudladona commented on issue #6474: [SUPPORT] Hudi Deltastreamer fails to acquire lock with DynamoDB Lock Provider.

2022-08-22 Thread GitBox



gudladona commented on issue #6474:
URL: https://github.com/apache/hudi/issues/6474#issuecomment-1223386268

   This seems to be more comprehensive sequence than the above
   
   ```
   cat hudi-logs.txt | grep -E 'TransactionManager|DynamoDBBasedLockProvider'   
   
08:55:38 PM
   hoodie.write.lock.provider: 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   22/08/22 02:04:05 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Optional.empty
   22/08/22 02:04:05 INFO org.apache.hudi.client.transaction.lock.LockManager: 
LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   22/08/22 02:04:07 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:07 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:07 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction started for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Optional.empty
   22/08/22 02:04:14 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ending with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   22/08/22 02:04:14 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:14 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:14 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   22/08/22 02:04:20 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Optional.empty with latest completed transaction instant Optional.empty
   22/08/22 02:04:20 INFO org.apache.hudi.client.transaction.lock.LockManager: 
LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   22/08/22 02:04:20 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:20 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:04:20 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction started for 
Optional.empty with latest completed transaction instant Optional.empty
   22/08/22 02:05:36 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ending with 
transaction owner Optional.empty
   22/08/22 02:05:36 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:05:36 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:05:36 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Optional.empty
   22/08/22 02:06:31 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Option{val=[20220822015627468__deltacommit__COMPLETED]}
   22/08/22 02:06:31 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:06:31 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:06:31 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction started for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Option{val=[20220822015627468__deltacommit__COMPLETED]}
   22/08/22 02:06:48 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Optional.empty with latest completed transaction instant Optional.empty
   22/08/22 02:06:48 INFO org.apache.hudi.client.transaction.lock.LockManager: 
LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   22/08/22 02:06:48 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:08:59 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLock

[GitHub] [hudi] hudi-bot commented on pull request #6450: [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false

2022-08-22 Thread GitBox



hudi-bot commented on PR #6450:
URL: https://github.com/apache/hudi/pull/6450#issuecomment-1223386217

   
   ## CI report:
   
   * 3bd700dea82006f1d3081c3eee7ab1b430728911 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3054) Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3054:
-
Sprint:   (was: 2022/08/22)

> Fix flaky TestHoodieClientMultiWriter. testHoodieClientBasicMultiWriter
> ---
>
> Key: HUDI-3054
> URL: https://issues.apache.org/jira/browse/HUDI-3054
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/4428/logs/21]
>  
> {code:java}
> 2021-12-17T11:39:57.1645757Z [INFO] Running 
> org.apache.hudi.client.TestHoodieClientMultiWriter
> 2021-12-17T11:39:57.3453991Z 339506 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.3984328Z 339559 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.5278608Z 339689 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:39:57.9783107Z 340139 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:39:57.9927490Z 340154 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit8865530218583640556/dataset/.hoodie/metadata
> 2021-12-17T11:40:10.1428665Z 352304 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:10.9930023Z 353149 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:11.4294603Z 353590 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:11.4763085Z 353637 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:11.6014876Z 353762 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:12.0892513Z 354250 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:12.1061317Z 354267 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3262960667280061850/dataset/.hoodie/metadata
> 2021-12-17T11:40:23.1499732Z 365311 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:24.1626167Z 366323 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.1945944Z 366355 [dispatcher-event-loop-5] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 0 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.3084730Z 366469 [dispatcher-event-loop-2] WARN  
> org.apache.spark.scheduler.TaskSetManager  - Stage 1 contains a task of very 
> large size (101 KB). The maximum recommended task size is 100 KB.
> 2021-12-17T11:40:24.7350862Z 366896 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:24.7482727Z 366909 [main] WARN  
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit294667857867877904/dataset/.hoodie/metadata
> 2021-12-17T11:40:43.1530857Z 385314 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:44.0641298Z 386225 [main] WARN  
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> 2021-12-17T11:40:4

[jira] [Updated] (HUDI-2528) Flaky test: MERGE_ON_READ testTableOperationsWithRestore

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2528:
-
Sprint:   (was: 2022/08/22)

> Flaky test: MERGE_ON_READ testTableOperationsWithRestore
> 
>
> Key: HUDI-2528
> URL: https://issues.apache.org/jira/browse/HUDI-2528
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing, tests-ci
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.12.1
>
>
>  
> {code:java}
>  [ERROR] Failures:[ERROR] There files should have been rolled-back when 
> rolling back commit 002 but are still remaining. Files: 
> [file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-592-8761_001.parquet,
>  
> file:/tmp/junit6464799159313857398/2016/03/15/9d59f0f1-9cfa-41a4-b247-6bf002ad6cc7-0_0-585-8754_001.parquet]
>  ==> expected: <0> but was: <2>[ERROR] Errors:[ERROR] No Compaction 
> request available at 007 to run compaction {code}
>  
> Probably the same cause as HUDI-2108
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4695) Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 expected: <4> but was: <5>

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4695:
-
Story Points: 3

> Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 
> expected: <4> but was: <5>
> --
>
> Key: HUDI-4695
> URL: https://issues.apache.org/jira/browse/HUDI-4695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

2022-08-22 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-4696:


 Summary: Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » 
NullPointer

 Key: HUDI-4696
 URL: https://issues.apache.org/jira/browse/HUDI-4696
 Project: Apache Hudi
  Issue Type: Task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4696:
-
Description: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934

> Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

> 
>
> Key: HUDI-4696
> URL: https://issues.apache.org/jira/browse/HUDI-4696
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

2022-08-22 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4696:
-
Story Points: 3

> Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

> 
>
> Key: HUDI-4696
> URL: https://issues.apache.org/jira/browse/HUDI-4696
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-22 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1223382898

   
   ## CI report:
   
   * 520b1b54c37ce6378047a55f650e309d6feb89d1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4695) Flaky: TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 expected: <4> but was: <5>

2022-08-22 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-4695:


 Summary: Flaky: 
TestInlineCompaction.testCompactionRetryOnFailureBasedOnTime:308 expected: <4> 
but was: <5>
 Key: HUDI-4695
 URL: https://issues.apache.org/jira/browse/HUDI-4695
 Project: Apache Hudi
  Issue Type: Task
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] gudladona commented on issue #6474: [SUPPORT] Hudi Deltastreamer fails to acquire lock with DynamoDB Lock Provider.

2022-08-22 Thread GitBox



gudladona commented on issue #6474:
URL: https://github.com/apache/hudi/issues/6474#issuecomment-1223380423

   Yes, it seems like the transaction that started at @ 02:06:31 
20220822020402958__deltacommit__INFLIGHT is on the metadata table. Also, it 
appears like this was held for 25 minutes, until @ 02:31:00


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-22 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1223379502

   
   ## CI report:
   
   * 520b1b54c37ce6378047a55f650e309d6feb89d1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6467: [HUDI-4686] Flip option 'write.ignore.failed' to default false

2022-08-22 Thread GitBox



yihua commented on code in PR #6467:
URL: https://github.com/apache/hudi/pull/6467#discussion_r952040628


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -327,9 +327,9 @@ private FlinkOptions() {
   public static final ConfigOption IGNORE_FAILED = ConfigOptions
   .key("write.ignore.failed")
   .booleanType()
-  .defaultValue(true)
+  .defaultValue(false)
   .withDescription("Flag to indicate whether to ignore any non exception 
error (e.g. writestatus error). within a checkpoint batch.\n"
-  + "By default true (in favor of streaming progressing over data 
integrity)");
+  + "By default false (in favor of streaming progressing over data 
integrity)");

Review Comment:
   Docs need to be updated.  The statement `in favor of streaming progressing 
over data integrity` is no longer valid for `false`.



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -96,7 +96,7 @@ private FlinkOptions() {
 
   public static final String NO_PRE_COMBINE = "no_precombine";
   public static final ConfigOption PRECOMBINE_FIELD = ConfigOptions
-  .key("payload.ordering.field")
+  .key("precombine.field")

Review Comment:
   Is this change needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6474: [SUPPORT] Hudi Deltastreamer fails to acquire lock with DynamoDB Lock Provider.

2022-08-22 Thread GitBox



nsivabalan commented on issue #6474:
URL: https://github.com/apache/hudi/issues/6474#issuecomment-1223377849

   Note to self:
   
   excerpt from the logs which of interest to us
   ```
   22/08/22 02:06:31 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Option{val=[20220822015627468__deltacommit__COMPLETED]}
   22/08/22 02:06:31 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:06:31 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:06:31 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction started for 
Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} with latest completed 
transaction instant Option{val=[20220822015627468__deltacommit__COMPLETED]}
   22/08/22 02:06:48 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Optional.empty with latest completed transaction instant Optional.empty
   22/08/22 02:06:48 INFO org.apache.hudi.client.transaction.lock.LockManager: 
LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   22/08/22 02:06:48 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:08:59 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:11:10 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:13:21 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:15:32 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:17:43 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:19:53 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:22:04 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:24:15 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:26:25 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:28:36 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: ACQUIRING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:30:47 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ending with 
transaction owner Optional.empty
   22/08/22 02:30:47 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:30:47 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Optional.empty
at 
org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
   22/08/22 02:31:00 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ending with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   22/08/22 02:31:00 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:31:00 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:31:00 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 >

1 - 100 of 482 matches

Mail list logo