[GitHub] [hudi] lokeshj1703 commented on pull request #7521: [HUDI-4827] Rebase Azure Image on Ubuntu 22.04 - scalatest-maven-plugin version update

2022-12-20 Thread GitBox


lokeshj1703 commented on PR #7521:
URL: https://github.com/apache/hudi/pull/7521#issuecomment-1360945659

   Related issue in scalatest-maven-plugin - 
https://github.com/scalatest/scalatest-maven-plugin/pull/43


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7528: [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied

2022-12-20 Thread GitBox


hudi-bot commented on PR #7528:
URL: https://github.com/apache/hudi/pull/7528#issuecomment-1360933260

   
   ## CI report:
   
   * f3a439884f90500e29da0075f4d0ad7d73a484b3 UNKNOWN
   * 91a60af68934fce696d23ace1db23d652a5bb109 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7423: [HUDI-5384] Adding optimization rule to appropriately push down filters into the `HoodieFileIndex`

2022-12-20 Thread GitBox


hudi-bot commented on PR #7423:
URL: https://github.com/apache/hudi/pull/7423#issuecomment-1360932876

   
   ## CI report:
   
   * 09b901a56869b8282c92d6c05ad746f98f2d6a01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13735)
 
   * 78a6da0b0d5d65f8e7f4c59b495a2820e1f9877f UNKNOWN
   * 0f49e489ea0fa07d75b572cc6b7bd97945da6373 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7528: [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied

2022-12-20 Thread GitBox


hudi-bot commented on PR #7528:
URL: https://github.com/apache/hudi/pull/7528#issuecomment-1360927504

   
   ## CI report:
   
   * f3a439884f90500e29da0075f4d0ad7d73a484b3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7527: [HUDI-5411] Avoid virtual key info for COW table in the input format

2022-12-20 Thread GitBox


hudi-bot commented on PR #7527:
URL: https://github.com/apache/hudi/pull/7527#issuecomment-1360927478

   
   ## CI report:
   
   * ed2f76f0edfad0ac2175da67a56825ee31a4dd4c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7423: [HUDI-5384] Adding optimization rule to appropriately push down filters into the `HoodieFileIndex`

2022-12-20 Thread GitBox


hudi-bot commented on PR #7423:
URL: https://github.com/apache/hudi/pull/7423#issuecomment-1360927203

   
   ## CI report:
   
   * 09b901a56869b8282c92d6c05ad746f98f2d6a01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13735)
 
   * 78a6da0b0d5d65f8e7f4c59b495a2820e1f9877f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7526: Revert "[HUDI-5409] Avoid file index and use fs view cache in COW input format (#7493)"

2022-12-20 Thread GitBox


hudi-bot commented on PR #7526:
URL: https://github.com/apache/hudi/pull/7526#issuecomment-1360923199

   
   ## CI report:
   
   * be61fe2207203761b46a40fa32be8ccd2ad6f12c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xccui commented on issue #7375: [SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro'

2022-12-20 Thread GitBox


xccui commented on issue #7375:
URL: https://github.com/apache/hudi/issues/7375#issuecomment-1360916357

   Built with the latest version but still encountered the same issue (with 
Flink r/w).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7528: [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied

2022-12-20 Thread GitBox


alexeykudinkin commented on code in PR #7528:
URL: https://github.com/apache/hudi/pull/7528#discussion_r1054026446


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -57,15 +57,15 @@ public class HoodieSparkEngineContext extends 
HoodieEngineContext {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieSparkEngineContext.class);
   private final JavaSparkContext javaSparkContext;
-  private SQLContext sqlContext;
+  private final SQLContext sqlContext;
 
   public HoodieSparkEngineContext(JavaSparkContext jsc) {
-super(new SerializableConfiguration(jsc.hadoopConfiguration()), new 
SparkTaskContextSupplier());
-this.javaSparkContext = jsc;
-this.sqlContext = SQLContext.getOrCreate(jsc.sc());
+this(jsc, SQLContext.getOrCreate(jsc.sc()));
   }
 
-  public void setSqlContext(SQLContext sqlContext) {
+  public HoodieSparkEngineContext(JavaSparkContext jsc, SQLContext sqlContext) 
{

Review Comment:
   This change is needed to accommodate for fixing of `HoodieClientTestHarness`



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -36,26 +36,29 @@ import org.apache.spark.sql.types.StructType
  * [[BaseRelation]] implementation only reading Base files of Hudi tables, 
essentially supporting following querying
  * modes:
  * 
- * For COW tables: Snapshot
- * For MOR tables: Read-optimized
+ *  For COW tables: Snapshot
+ *  For MOR tables: Read-optimized
  * 
  *
- * NOTE: The reason this Relation is used in liue of Spark's default 
[[HadoopFsRelation]] is primarily due to the
+ * NOTE: The reason this Relation is used in-liue of Spark's default 
[[HadoopFsRelation]] is primarily due to the
  * fact that it injects real partition's path as the value of the partition 
field, which Hudi ultimately persists
  * as part of the record payload. In some cases, however, partition path might 
not necessarily be equal to the
  * verbatim value of the partition path field (when custom [[KeyGenerator]] is 
used) therefore leading to incorrect
  * partition field values being written
  */
-class BaseFileOnlyRelation(sqlContext: SQLContext,
-   metaClient: HoodieTableMetaClient,
-   optParams: Map[String, String],
-   userSchema: Option[StructType],
-   globPaths: Seq[Path])
-  extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) 
with SparkAdapterSupport {
+case class BaseFileOnlyRelation(override val sqlContext: SQLContext,

Review Comment:
   Primary change here is converting the class to be a case class, which in 
turn entails that all of the ctor parameters would become field values 
requiring corresponding annotation



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala:
##
@@ -42,12 +42,35 @@ import scala.collection.JavaConverters._
 case class HoodieMergeOnReadFileSplit(dataFile: Option[PartitionedFile],
   logFiles: List[HoodieLogFile]) extends 
HoodieFileSplit
 
-class MergeOnReadSnapshotRelation(sqlContext: SQLContext,
-  optParams: Map[String, String],
-  userSchema: Option[StructType],
-  globPaths: Seq[Path],
-  metaClient: HoodieTableMetaClient)
-  extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) {
+case class MergeOnReadSnapshotRelation(override val sqlContext: SQLContext,

Review Comment:
   Same changes as other relations



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -383,50 +372,28 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*/
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[FileSplit]
 
-  /**
-   * Get all PartitionDirectories based on globPaths if specified, otherwise 
use the table path.
-   * Will perform pruning if necessary
-   */
-  private def listPartitionDirectories(globPaths: Seq[Path], partitionFilters: 
Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = {

Review Comment:
   Combined this 2 methods into 1 



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##
@@ -427,6 +428,10 @@ class TestMORDataSource extends HoodieClientTestBase with 
SparkDatasetMixin {
   @ParameterizedTest
   @EnumSource(value = classOf[HoodieRecordType], names = Array("AVRO", 
"SPARK"))
   def testPrunedFiltered(recordType: HoodieRecordType) {
+
+spark.sessionState.conf.setConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED, false)

Review Comment:
   Will be reverted



##

[jira] [Updated] (HUDI-5443) Fix exception when querying MOR table after applying NestedSchemaPruning optimization

2022-12-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5443:
-
Labels: pull-request-available  (was: )

> Fix exception when querying MOR table after applying NestedSchemaPruning 
> optimization
> -
>
> Key: HUDI-5443
> URL: https://issues.apache.org/jira/browse/HUDI-5443
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
>
> This has been discovered while working on HUDI-5384.
> After NestedSchemaPruning has been applied successfully, reading from MOR 
> table could encountered following exception when actual delta-log file 
> merging would be performed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin opened a new pull request, #7528: [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied

2022-12-20 Thread GitBox


alexeykudinkin opened a new pull request, #7528:
URL: https://github.com/apache/hudi/pull/7528

   ### Change Logs
   
   Currently MOR tables w/ `NestedSchemaPruning` rule successfully applied (ie 
being able to prune nested schema) would fail to read in case any log-file 
merging would occur.
   
   TBA
   
   ### Impact
   
   TBA
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on pull request #7519: [HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while

2022-12-20 Thread GitBox


zhangyue19921010 commented on PR #7519:
URL: https://github.com/apache/hudi/pull/7519#issuecomment-1360909416

   > I guess this PR is related with 
https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata 
files are archived but the replaced files are not cleaned, the query would see 
duplicates.
   
   Hi @danny0405 I think it have something related, but not aiming to solve the 
same issue.
   In HUDI-5341 is trying to solve incremental clean didn't clean all the 
replaced files as we expected which will causing data duplicate.
   
   In this PR, we are trying to have a new control for `KEEP_LATEST_VERSIONS` 
delete all the replaced files immediate which will cause downstream query 
failed.
   
   of cause we need to set this time carefully to make sure all replaced files 
are deleted before archive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5443) Fix exception when querying MOR table after applying NestedSchemaPruning optimization

2022-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5443:
--
Status: In Progress  (was: Open)

> Fix exception when querying MOR table after applying NestedSchemaPruning 
> optimization
> -
>
> Key: HUDI-5443
> URL: https://issues.apache.org/jira/browse/HUDI-5443
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> This has been discovered while working on HUDI-5384.
> After NestedSchemaPruning has been applied successfully, reading from MOR 
> table could encountered following exception when actual delta-log file 
> merging would be performed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5444) FileNotFound issue w/ metadata enabled

2022-12-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-5444:
-

 Summary: FileNotFound issue w/ metadata enabled
 Key: HUDI-5444
 URL: https://issues.apache.org/jira/browse/HUDI-5444
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


stacktrace
{code:java}
Caused by: java.io.FileNotFoundException: File not found: 
gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
 {code}
 

20221208133227028 (RB_C10)
20221208133227028001 MDT compaction

20221208132316380 (C10)
20221208133647230


DT

 8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
║     │                   │          │           │ 2022120413756 │          
      │                │                ║
╟─┼───┼──┼───┼───┼┼┼╢
║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
║     │                   │          │           │ 20221208132316380 │          
      │                │                ║
╟─┼───┼──┼───┼───┼┼┼╢
║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
║     │                   │          │           │ 20221208133222583 │          
      │                │                ║
╟─┼───┼──┼───┼───┼┼┼╢

MDT timeline: 

-rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
20221208133227028.deltacommit.requested
-rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
20221208133227028.deltacommit.inflight
-rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
-rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
20221208133227028001.compaction.requested
-rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
20221208133227028001.compaction.inflight
-rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
-rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
20221208132316380.deltacommit.requested
-rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
20221208132316380.deltacommit.inflight
-rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
-rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
20221208133647230.deltacommit.requested
-rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
20221208133647230.deltacommit.inflight
-rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled

2022-12-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5444:
--
Priority: Blocker  (was: Major)

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5444) FileNotFound issue w/ metadata enabled

2022-12-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-5444:
-

Assignee: sivabalan narayanan

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled

2022-12-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5444:
--
Sprint: 0.13.0 Final Sprint

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled

2022-12-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5444:
--
Fix Version/s: 0.13.0

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] codope commented on a diff in pull request #7527: [HUDI-5411] Avoid virtual key info for COW table in the input format

2022-12-20 Thread GitBox


codope commented on code in PR #7527:
URL: https://github.com/apache/hudi/pull/7527#discussion_r1054022432


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java:
##
@@ -247,81 +239,33 @@ private List 
listStatusForSnapshotMode(JobConf job,
   boolean shouldIncludePendingCommits =
   HoodieHiveUtils.shouldIncludePendingCommits(job, 
tableMetaClient.getTableConfig().getTableName());
 
+  HiveHoodieTableFileIndex fileIndex =
+  new HiveHoodieTableFileIndex(
+  engineContext,
+  tableMetaClient,
+  props,
+  HoodieTableQueryType.SNAPSHOT,
+  partitionPaths,
+  queryCommitInstant,
+  shouldIncludePendingCommits);
+
+  Map> partitionedFileSlices = 
fileIndex.listFileSlices();
+
   // NOTE: Fetching virtual key info is a costly operation as it needs to 
load the commit metadata.
   //   This is only needed for MOR realtime splits. Hence, for COW 
tables, this can be avoided.
   Option virtualKeyInfoOpt = 
tableMetaClient.getTableType().equals(COPY_ON_WRITE) ? Option.empty() : 
getHoodieVirtualKeyInfo(tableMetaClient);

Review Comment:
   This is the main change. Earlier it used to be simply 
   `Option virtualKeyInfoOpt = 
getHoodieVirtualKeyInfo(tableMetaClient);`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] shengchiqu commented on issue #7507: [SUPPORT] how to use flink offline with occ

2022-12-20 Thread GitBox


shengchiqu commented on issue #7507:
URL: https://github.com/apache/hudi/issues/7507#issuecomment-1360897712

   > There is no need OCC here for offline compaction, all you need to do is 
the start the `HoodieFlinkCompactor` app.
   
   @danny0405 thanks, the flink sql set 'metadata.enabled' = 'true', but in 
flink offline compaction i can't find any metadata properties in 
org.apache.hudi.sink.compact.FlinkCompactionConfig
   if offline compaction not update metadata table , is the Index unavailable ?
   
   i try the spark offline compaction, it seems to be working fine, and find 
the zk lock node is created, and hdfs hudi table path .hoodie/metadata/files/ 
is update
   ```shell
   /opt/spark-2.4.5-bin-without-hadoop/bin/spark-submit \
   --master yarn \
   --deploy-mode client \
   --class org.apache.hudi.utilities.HoodieCompactor 
hudi-utilities-bundle_2.11-0.12.1.jar \
   -sm 2g \
   --mode execute \
   --base-path hdfs://ip:8020/hudi/customer \
   --table-name customer \
   --hoodie-conf hoodie.embed.timeline.server=false \
   --hoodie-conf hoodie.write.concurrency.mode=optimistic_concurrency_control \
   --hoodie-conf hoodie.cleaner.policy.failed.writes=LAZY \
   --hoodie-conf 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
 \
   --hoodie-conf hoodie.write.lock.zookeeper.url=ip \
   --hoodie-conf hoodie.write.lock.zookeeper.port=port \
   --hoodie-conf hoodie.write.lock.zookeeper.lock_key=customer \
   --hoodie-conf hoodie.write.lock.zookeeper.base_path=/hudi \
   --hoodie-conf hoodie.metadata.index.bloom.filter.enable=true \
   --hoodie-conf hoodie.metadata.index.column.stats.enable=false \
   --hoodie-conf hoodie.enable.data.skipping=false
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7519: [HUDI-5422] Control KEEP_LATEST_VERSIONS clean replaced files immediately or delete after a while

2022-12-20 Thread GitBox


danny0405 commented on PR #7519:
URL: https://github.com/apache/hudi/pull/7519#issuecomment-1360897525

   I guess this PR is related with 
https://github.com/apache/hudi/pull/7405/files, if the clsutering metadata 
files are archived but the replaced files are not cleaned, the query would see 
duplicates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5411) Make sure Trino does not re-instantiates Hive's InputFormat for every partition during file listing

2022-12-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5411:
-
Labels: pull-request-available  (was: )

> Make sure Trino does not re-instantiates Hive's InputFormat for every 
> partition during file listing
> ---
>
> Key: HUDI-5411
> URL: https://issues.apache.org/jira/browse/HUDI-5411
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: trino-presto
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> To unblock 0.12.2, we've implemented a stop-gap falling back to 
> FileSystemView-based listing (HUDI-5409).
> This is not an appropriate long-term solution though, and we need to make 
> sure we fix it properly by avoiding re-instantiating InputFormats w/in Trino 
> itself (so that we can properly use the FileIndex and MT) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] codope opened a new pull request, #7527: [HUDI-5411] Avoid virtual key info for COW table in the input format

2022-12-20 Thread GitBox


codope opened a new pull request, #7527:
URL: https://github.com/apache/hudi/pull/7527

   ### Change Logs
   
   Fetching virtual key involved reading from commit metadata or data file 
(`TableSchemaResolver`) which is a costly operation. This is only needed for 
schema projection in the case of MOR table (realtime splits). So, we can avoid 
it in case of COW table.
   
   ### Impact
   
   Improves performance of hive-compatible query engines that depend on input 
format implementation in Hudi, e.g. trino-hive connector. Tested listing on a 
TPC-DS table with 1824 partitions.
   
   Without this change (1.5 minutes):
   ```
   trino:default> select count(*) from store_sales;
 _col0
   -
2750311
   (1 row)
   
   Query 20221221_054403_3_t63mx, FINISHED, 1 node
   Splits: 1,832 total, 1,832 done (100.00%)
   1:29 [2.75M rows, 28.5MB] [30.8K rows/s, 327KB/s]
   ```
   
   With this change (18 seconds):
   ```
   trino:default> select count(*) from store_sales;
 _col0
   -
2750311
   (1 row)
   
   Query 20221221_055625_2_knx5g, FINISHED, 1 node
   Splits: 1,832 total, 1,832 done (100.00%)
   17.30 [2.75M rows, 28.5MB] [169K rows/s, 1.75MB/s]
   ```
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5022) Add better error messages to pr compliance

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5022:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add better error messages to pr compliance
> --
>
> Key: HUDI-5022
> URL: https://issues.apache.org/jira/browse/HUDI-5022
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: code-quality, dev-experience, docs, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When the pr compliance fails, the messages could be more helpful to users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4970) hudi-kafka-connect-bundle: Could not initialize class org.apache.hadoop.security.UserGroupInformation

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4970:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> hudi-kafka-connect-bundle: Could not initialize class 
> org.apache.hadoop.security.UserGroupInformation
> -
>
> Key: HUDI-4970
> URL: https://issues.apache.org/jira/browse/HUDI-4970
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> The Kafka connect sink loads successfully but fails to sync Hudi table due to 
> NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.security.UserGroupInformation
> {code:java}
> [2022-10-03 14:31:49,872] INFO The value of 
> hoodie.datasource.write.keygenerator.type is empty, using SIMPLE 
> (org.apache.hudi.keygen.factory.HoodieAvroKeyGeneratorFactory:63)[2022-10-03 
> 14:31:49,872] INFO Setting record key volume and partition fields date for 
> table file:///tmp/hoodie/hudi-test-topichudi-test-topic 
> (org.apache.hudi.connect.writers.KafkaConnectTransactionServices:93)[2022-10-03
>  14:31:49,872] INFO Initializing file:///tmp/hoodie/hudi-test-topic as hoodie 
> table file:///tmp/hoodie/hudi-test-topic 
> (org.apache.hudi.common.table.HoodieTableMetaClient:424)[2022-10-03 
> 14:31:49,872] INFO Existing partitions deleted [hudi-test-topic-0] 
> (org.apache.hudi.connect.HoodieSinkTask:156)[2022-10-03 14:31:49,872] ERROR 
> WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable 
> exception. Task is being killed and will not recover until manually restarted 
> (org.apache.kafka.connect.runtime.WorkerTask:184)java.lang.NoClassDefFoundError:
>  Could not initialize class org.apache.hadoop.security.UserGroupInformation   
> at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3431) 
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3421)   
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3263)  at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)   at 
> org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:110)at 
> org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:103)at 
> org.apache.hudi.common.table.HoodieTableMetaClient.initTableAndGetMetaClient(HoodieTableMetaClient.java:426)
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.initTable(HoodieTableMetaClient.java:1110)
> at 
> org.apache.hudi.connect.writers.KafkaConnectTransactionServices.(KafkaConnectTransactionServices.java:104)
>  at 
> org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.(ConnectTransactionCoordinator.java:88)
>   at 
> org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
> at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:635)
>   at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:71){code}
> Follow [https://github.com/apache/hudi/tree/master/hudi-kafka-connect#readme] 
> to reproduce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5285) Exclude hive-site.xml from packaging in hudi-utilities

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5285:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Exclude hive-site.xml from packaging in hudi-utilities
> --
>
> Key: HUDI-5285
> URL: https://issues.apache.org/jira/browse/HUDI-5285
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> the spark cluster can fail to access the external hive source normally due to 
> conflict with hive-site.xml packaged with hudi



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on issue #7507: [SUPPORT] how to use flink offline with occ

2022-12-20 Thread GitBox


danny0405 commented on issue #7507:
URL: https://github.com/apache/hudi/issues/7507#issuecomment-1360882363

   There is no need OCC here for offline compaction, all you need to do is the 
start the `HoodieFlinkCompactor` app.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope opened a new pull request, #7526: Revert "[HUDI-5409] Avoid file index and use fs view cache in COW input format (#7493)"

2022-12-20 Thread GitBox


codope opened a new pull request, #7526:
URL: https://github.com/apache/hudi/pull/7526

   ### Change Logs
   
   This reverts commit cc1c1e7b33d9c95e5a2ba0e9a1db428d1e1b2a00.
   
   ### Impact
   
   Impacts performance of query engines, such as Trino-Hive connector, that 
depend on input format to fetch splits.
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4963) Extend InProcessLockProvider to support multiple table ingestion

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4963:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Extend InProcessLockProvider to support multiple table ingestion
> 
>
> Key: HUDI-4963
> URL: https://issues.apache.org/jira/browse/HUDI-4963
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5404) add flink bundle validation

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5404:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> add flink bundle validation
> ---
>
> Key: HUDI-5404
> URL: https://issues.apache.org/jira/browse/HUDI-5404
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Make flink bundles validated via GitHub actions CI



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4605) Upgrade hudi-presto-bundle version to 0.12.0

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4605:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Upgrade hudi-presto-bundle version to 0.12.0
> 
>
> Key: HUDI-4605
> URL: https://issues.apache.org/jira/browse/HUDI-4605
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5145) Remove HDFS from DeltaStreamer UT/FT

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5145:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Remove HDFS from DeltaStreamer UT/FT
> 
>
> Key: HUDI-5145
> URL: https://issues.apache.org/jira/browse/HUDI-5145
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5131) Bundle validation: upgrade/downgrade

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5131:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Bundle validation: upgrade/downgrade
> 
>
> Key: HUDI-5131
> URL: https://issues.apache.org/jira/browse/HUDI-5131
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5132) Bundle validation: Hive QL 3

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5132:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Bundle validation: Hive QL 3
> 
>
> Key: HUDI-5132
> URL: https://issues.apache.org/jira/browse/HUDI-5132
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5371) Fix flaky testMetadataColumnStatsIndex

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5371:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Fix flaky testMetadataColumnStatsIndex
> --
>
> Key: HUDI-5371
> URL: https://issues.apache.org/jira/browse/HUDI-5371
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0
>
>
> The test started flaking after [https://github.com/apache/hudi/pull/7349]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5099) Update stock data so that new records are added in batch_2

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5099:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Update stock data so that new records are added in batch_2
> --
>
> Key: HUDI-5099
> URL: https://issues.apache.org/jira/browse/HUDI-5099
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> The record key is  "\{stock name}_\{date} \{hour}". We have the data from 
> 9:30-10:29 in batch_1 and batch_2 contains data from 10:30-10:59. This means 
> that no new records are introduced, and therefore, only updates occur when 
> ingesting batch_2. This makes validation of the data take too long for our 
> testing. Proposed solution is to move the data from 10:00-10:29 into batch_2 
> so that we will have updates and inserts in both files



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5200) Resources are not cleaned up in UT

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5200:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Resources are not cleaned up in UT
> --
>
> Key: HUDI-5200
> URL: https://issues.apache.org/jira/browse/HUDI-5200
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Resources are not cleaned up at UT



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4209) Avoid using HDFS in HoodieClientTestHarness

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4209:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Avoid using HDFS in HoodieClientTestHarness
> ---
>
> Key: HUDI-4209
> URL: https://issues.apache.org/jira/browse/HUDI-4209
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Sagar Sumit
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4982) Make bundle combination testing covered in CI

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4982:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Make bundle combination testing covered in CI
> -
>
> Key: HUDI-4982
> URL: https://issues.apache.org/jira/browse/HUDI-4982
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> this is to cover 
> - spark-bundle 
> - utilities-bundle
> - utilities-slim-bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5098) Enable Spark2.4 bundle testing in GH Actions

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5098:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Enable Spark2.4 bundle testing in GH Actions
> 
>
> Key: HUDI-5098
> URL: https://issues.apache.org/jira/browse/HUDI-5098
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Jonathan Vexler
>Priority: Major
> Fix For: 0.13.0
>
>
> Bundle testing works for 3.1,3.2,3.3, but there was a hive setup issue that 
> wasn't being handled properly. Because we have azure-ci running with 2.4, we 
> decided to resolve this issue in the future



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2673) Add integration/e2e test for kafka-connect functionality

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-2673:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add integration/e2e test for kafka-connect functionality
> 
>
> Key: HUDI-2673
> URL: https://issues.apache.org/jira/browse/HUDI-2673
> Project: Apache Hudi
>  Issue Type: Test
>  Components: kafka-connect, tests-ci
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> The integration test should use bundle jar and run in docker setup.  This can 
> prevent any issue in the bundle, like HUDI-3903, that is not covered by unit 
> and functional tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5358) Fix flaky tests in TestCleanerInsertAndCleanByCommits

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5358:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Fix flaky tests in TestCleanerInsertAndCleanByCommits
> -
>
> Key: HUDI-5358
> URL: https://issues.apache.org/jira/browse/HUDI-5358
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In the tests, the {{KEEP_LATEST_COMMITS}} cleaner policy is used. This policy 
> first figures out the earliest commit to retain based on the config of the 
> number of retained commits ({{{}hoodie.cleaner.commits.retained{}}}). Then, 
> for each file group, one more version before the earliest commit to retain is 
> also kept from cleaning. The commit for the version can be different among 
> file groups. 
> However, the current validation logic only statically picks the one commit 
> before the earliest commit to retain in the Hudi timeline for all file 
> groups, which does not match the {{KEEP_LATEST_COMMITS}} cleaner policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5330) Add docs for virtual keys

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5330:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add docs for virtual keys
> -
>
> Key: HUDI-5330
> URL: https://issues.apache.org/jira/browse/HUDI-5330
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> Currently, the virtual key support is only presented in a blog: 
> [https://hudi.apache.org/blog/2021/08/18/virtual-keys/#virtual-key-support.]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5339) Update docs regarding the behavior change in NONE sort mode for bulk insert

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5339:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Update docs regarding the behavior change in NONE sort mode for bulk insert
> ---
>
> Key: HUDI-5339
> URL: https://issues.apache.org/jira/browse/HUDI-5339
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5295) With multiple meta syncs, one meta sync failure should not impact other meta syncs.

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5295:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> With multiple meta syncs, one meta sync failure should not impact other meta 
> syncs.
> ---
>
> Key: HUDI-5295
> URL: https://issues.apache.org/jira/browse/HUDI-5295
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, meta-sync, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> For example, if you are using HMS and glue, if HMS sync fails, we should 
> still sync with glue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5343) HoodieFlinkStreamer supports async clustering for append mode

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5343:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> HoodieFlinkStreamer supports async clustering for append mode
> -
>
> Key: HUDI-5343
> URL: https://issues.apache.org/jira/browse/HUDI-5343
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> HoodieFlinkStreamer supports async clustering for append mode, which keeps 
> the consistent with the pipeline of HoodieTableSink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5292:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Exclude the test resources from every module packaging
> --
>
> Key: HUDI-5292
> URL: https://issues.apache.org/jira/browse/HUDI-5292
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0
>
>
> Exclude the test resources, especially the properties files that conflict 
> with user-provided resources, from every module. This is a followup to 
> https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5294) Support type change for schema on read enable + reconcile schema

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5294:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Support type change for schema on read enable + reconcile schema
> 
>
> Key: HUDI-5294
> URL: https://issues.apache.org/jira/browse/HUDI-5294
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Tao Meng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> https://github.com/apache/hudi/issues/7283



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5283) Replace deprecated method Schema.parse with Schema.Parser

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5283:
-
Fix Version/s: (was: 0.12.2)

> Replace deprecated method Schema.parse with Schema.Parser
> -
>
> Key: HUDI-5283
> URL: https://issues.apache.org/jira/browse/HUDI-5283
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When reading the code, I found that 
> HoodieBootstrapSchemaProvider#getBootstrapSchema uses the deprecated method 
> Schema.parse, which can be replaced by Schema.Parser().parse(),
> At the same time, I searched at the moudle level, only to find that this 
> place uses an deprecated method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5293) Schema on read + reconcile schema fails w/ 0.12.1

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5293:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Schema on read + reconcile schema fails w/ 0.12.1
> -
>
> Key: HUDI-5293
> URL: https://issues.apache.org/jira/browse/HUDI-5293
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> if I do schema on read on commit1 and then schema on read + reconcile schema 
> for 2nd batch, it fails w/ 
> {code:java}
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> 22/11/28 16:44:26 ERROR BaseSparkCommitActionExecutor: Error upserting 
> bucketType UPDATE for partition :2
> java.lang.IllegalArgumentException: cannot modify hudi meta col: 
> _hoodie_commit_time
>   at 
> org.apache.hudi.internal.schema.action.TableChange$BaseColumnChange.checkColModifyIsLegal(TableChange.java:157)
>   at 
> org.apache.hudi.internal.schema.action.TableChanges$ColumnAddChange.addColumns(TableChanges.java:314)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.lambda$reconcileSchema$5(AvroSchemaEvolutionUtils.java:92)
>   at 
> java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>   at 
> org.apache.hudi.internal.schema.utils.AvroSchemaEvolutionUtils.reconcileSchema(AvroSchemaEvolutionUtils.java:80)
>   at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:103)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira

[jira] [Updated] (HUDI-5258) Address checkstyle warnings in hudi-common module

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5258:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Address checkstyle warnings in hudi-common module
> -
>
> Key: HUDI-5258
> URL: https://issues.apache.org/jira/browse/HUDI-5258
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dev-experience
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5261) Use proper parallelism for engine context APIs

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5261:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Use proper parallelism for engine context APIs
> --
>
> Key: HUDI-5261
> URL: https://issues.apache.org/jira/browse/HUDI-5261
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> do a global search of these APIs
> - org.apache.hudi.common.engine.HoodieEngineContext#flatMap
> - org.apache.hudi.common.engine.HoodieEngineContext#map
> and similar ones take in parallelism.
> A lot of occurrences are using number of items as parallelism, which affect 
> performance. Parallelism should be based on num cores available in the 
> cluster and set by user via parallelism configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5269) Enhancing core user flow tests for spark-sql writes

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5269:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Enhancing core user flow tests for spark-sql writes
> ---
>
> Key: HUDI-5269
> URL: https://issues.apache.org/jira/browse/HUDI-5269
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We triaged some of the core user flows and looks like we don't have a good 
> coverage for those flows. 
>  
>  # 
>  ## {{COW and MOR(w/ and w/o metadata enabled)}}
>  ### {{{}Partitioned(BLOOM, SIMPLE, GLOBAL_BLOOM, }}BUCKET\{{{}){}}}, 
> {{{}non-partitioned(GLOBAL_BLOOM){}}}.
>   
>  # {\{Immutable data. pure bulk_insert row writing. }}
>  # {\{Immutable w/ file sizing. pure inserts. }}
>  # {\{initial bulk ingest, followed by updates. bulk_insert followed by 
> upserts. }}
>  # {{{}regular inserts + updates combined{*}{{*}{ \{{ ** }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5252) ClusteringCommitSink supports to rollback clustering

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5252:
-
Fix Version/s: (was: 0.12.2)

> ClusteringCommitSink supports to rollback clustering
> 
>
> Key: HUDI-5252
> URL: https://issues.apache.org/jira/browse/HUDI-5252
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When commit buffer has failed ClusteringCommitEvent, the ClusteringCommitSink 
> invokes the CompactionUtil#rollbackCompaction to rollback clustering. 
> ClusteringCommitSink should call ClusteringUtil#rollbackClustering to 
> rollback clustering. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5246) Improve validation for partition path

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5246:
-
Fix Version/s: (was: 0.12.2)

> Improve validation for partition path
> -
>
> Key: HUDI-5246
> URL: https://issues.apache.org/jira/browse/HUDI-5246
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Assignee: Hemanth Gowda
>Priority: Minor
>  Labels: hudi-on-call, new-to-hudi, pull-request-available
>
> To fail early if absolute path is set for partition (e.g. with leading `/`)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5241) Optimize HoodieDefaultTimeline API

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5241:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Optimize HoodieDefaultTimeline API
> --
>
> Key: HUDI-5241
> URL: https://issues.apache.org/jira/browse/HUDI-5241
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5246) Improve validation for partition path

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5246:
-
Fix Version/s: 0.13.0

> Improve validation for partition path
> -
>
> Key: HUDI-5246
> URL: https://issues.apache.org/jira/browse/HUDI-5246
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Assignee: Hemanth Gowda
>Priority: Minor
>  Labels: hudi-on-call, new-to-hudi, pull-request-available
> Fix For: 0.13.0
>
>
> To fail early if absolute path is set for partition (e.g. with leading `/`)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5198) add in minor perf wins in hudi-utilities and locking related tests

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5198:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> add in minor perf wins in hudi-utilities and locking related tests
> --
>
> Key: HUDI-5198
> URL: https://issues.apache.org/jira/browse/HUDI-5198
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5234) Streaming read skip clustering instants Configurable

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5234:
-
Fix Version/s: (was: 0.12.2)

> Streaming read skip clustering instants Configurable
> 
>
> Key: HUDI-5234
> URL: https://issues.apache.org/jira/browse/HUDI-5234
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: zhuanshenbsj1
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5167) Reduce test run time for virtual key tests

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5167:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Reduce test run time for virtual key tests
> --
>
> Key: HUDI-5167
> URL: https://issues.apache.org/jira/browse/HUDI-5167
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We did parametrized for quite a few tests when we added virtual keys. some of 
> them may not be required. so lets revisit them and reduce whereever 
> applicable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5181) Enhance keygen class validation

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5181:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Enhance keygen class validation
> ---
>
> Key: HUDI-5181
> URL: https://issues.apache.org/jira/browse/HUDI-5181
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.13.0
>
>
> Some in-code validations can be added to early alert users who set keygen 
> configs improperly. For example, in TimestampBased keygen, output format 
> cannot be empty.
> We should audit all built-in keygen classes and add UTs and proper 
> validations. This is to improve usability and save time in troubleshooting 
> when misconfig happened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5166) Reduce test run time for top time consuming tests

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5166:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Reduce test run time for top time consuming tests
> -
>
> Key: HUDI-5166
> URL: https://issues.apache.org/jira/browse/HUDI-5166
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5178) Add Call show_table_properties for spark sql

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5178:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add Call show_table_properties for spark sql
> 
>
> Key: HUDI-5178
> URL: https://issues.apache.org/jira/browse/HUDI-5178
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5162) Allow user specified start offset for streaming query

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5162:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Allow user specified start offset for streaming query
> -
>
> Key: HUDI-5162
> URL: https://issues.apache.org/jira/browse/HUDI-5162
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core, spark
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Add new configure: hoodie.datasource.streaming.startOffset to allow users to 
> specify start offset for streaming query



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5112) Add presto query validation support for all tests in integ tests

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5112:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add presto query validation support for all tests in integ tests
> 
>
> Key: HUDI-5112
> URL: https://issues.apache.org/jira/browse/HUDI-5112
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5113) Add support to test different indexes with integ test

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5113:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add support to test different indexes with integ test
> -
>
> Key: HUDI-5113
> URL: https://issues.apache.org/jira/browse/HUDI-5113
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5060) Make all clean policies support incremental mode to find partition paths

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5060:
-
Fix Version/s: (was: 0.12.2)

> Make all clean policies support incremental mode to find partition paths
> 
>
> Key: HUDI-5060
> URL: https://issues.apache.org/jira/browse/HUDI-5060
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
>
> Make all clean policies support incremental mode to find partition paths



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5072) Extract transform duplicate code

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5072:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Extract transform duplicate code
> 
>
> Key: HUDI-5072
> URL: https://issues.apache.org/jira/browse/HUDI-5072
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When reading the code, I found that the transform methods of 
> MultipleSparkJobExecutionStrategy and SingleSparkJobExecutionStrategy have 
> redundant code. I think we can extract them to make the code cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5052) Update 0.12.0 docs for regression

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5052:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Update 0.12.0 docs for regression
> -
>
> Key: HUDI-5052
> URL: https://issues.apache.org/jira/browse/HUDI-5052
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5051) Add a functional regression test for Bloom Index followed on w/ Upserts

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5051:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add a functional regression test for Bloom Index followed on w/ Upserts
> ---
>
> Key: HUDI-5051
> URL: https://issues.apache.org/jira/browse/HUDI-5051
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 0.13.0
>
>
> In the test
>  * State is initially bootstrapped by Bulk Insert (row-writing)
>  * Follow-up w/ upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5035) Remove deprecated API usage in SparkPreCommitValidator#validate

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5035:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Remove deprecated API usage in SparkPreCommitValidator#validate
> ---
>
> Key: HUDI-5035
> URL: https://issues.apache.org/jira/browse/HUDI-5035
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: image-2022-10-15-07-23-43-689.png
>
>
> I found that the code uses the deprecated API, modify the code to use the 
> recommended API
>  
> !image-2022-10-15-07-23-43-689.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5032) Add Archiving to the CLI

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5032:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add Archiving to the CLI
> 
>
> Key: HUDI-5032
> URL: https://issues.apache.org/jira/browse/HUDI-5032
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving, cli
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4990) Parallelize deduplication in CLI tool

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4990:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Parallelize deduplication in CLI tool
> -
>
> Key: HUDI-4990
> URL: https://issues.apache.org/jira/browse/HUDI-4990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.13.0
>
>
> The CLI tool command `repair deduplicate` repair one partition at a time.  To 
> repair hundreds of partitions, this takes time.  We should add a mode to take 
> multiple partition paths for the CLI and run the dedup job for multiple 
> partitions at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5018) Make user-provided copyOnWriteRecordSizeEstimate first precedence

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5018:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Make user-provided copyOnWriteRecordSizeEstimate first precedence
> -
>
> Key: HUDI-5018
> URL: https://issues.apache.org/jira/browse/HUDI-5018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Raymond Xu
>Assignee: xi chaomin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> For estimated avg record size
> https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate
> which is used here
> https://github.com/apache/hudi/blob/86a1efbff1300603a8180111eae117c7f9dbd8a5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L372
> Propose to respect user setting by following the precedence as below
> 1) if user sets a value, then use it as is 
> 2) if user not setting it, infer from timeline commit metadata 
> 3) if timeline is empty, use a default (current: 1024)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4967:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4888) Add validation to block COW table to use consistent hashing bucket index

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4888:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add validation to block COW table to use consistent hashing bucket index
> 
>
> Key: HUDI-4888
> URL: https://issues.apache.org/jira/browse/HUDI-4888
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yuwei Xiao
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Consistent hashing bucket index's resizing relies on the log feature of MOR 
> table. So with COW table, the consistent hashing bucket index can not achieve 
> resizing currently. 
> We should block the user from using it at the very beginning(i.e., table 
> creation), and suggest them to use MOR table or Simple Bucket Index. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4881) Push down filters if possible when syncing partitions to Hive

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4881:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Push down filters if possible when syncing partitions to Hive
> -
>
> Key: HUDI-4881
> URL: https://issues.apache.org/jira/browse/HUDI-4881
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: hive, meta-sync
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4839) rocksdbjni is not compatible with apple silicon

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4839:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> rocksdbjni is not compatible with apple silicon
> ---
>
> Key: HUDI-4839
> URL: https://issues.apache.org/jira/browse/HUDI-4839
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> rocksdbjni 5.17.2 is not compatible with apple silicon
> when set FileSystemViewStorageType.EMBEDDED_KV_STORE in apple m1 raise error 
> like this
> {code:java}
> java.lang.UnsatisfiedLinkError: 
> /private/var/folders/px/y3gybll50ggctcjp2t4r2b50gp/T/librocksdbjni1847223031371241574.jnilib:
>  
> dlopen(/private/var/folders/px/y3gybll50ggctcjp2t4r2b50gp/T/librocksdbjni1847223031371241574.jnilib,
>  0x0001): tried: 
> '/private/var/folders/px/y3gybll50ggctcjp2t4r2b50gp/T/librocksdbjni1847223031371241574.jnilib'
>  (mach-o file, but is an incompatible architecture (have 'x86_64', need 
> 'arm64e')) {code}
> After 6.29.4.1, rocksdb can work on M1 macs.  
> [here|https://github.com/facebook/rocksdb/issues/7720]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4823) Add read_optimize spark_session config to use in spark-sql

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4823:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Add read_optimize spark_session config to use in spark-sql
> --
>
> Key: HUDI-4823
> URL: https://issues.apache.org/jira/browse/HUDI-4823
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When create a table not using hive catalog in spark, we can not easily do 
> read_optimized query in sqark-sql(using global hudi config file is 
> inconvenient),so I add the read_optimize spark_session config to use in 
> spark-sql



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2913) Disable auto clean in writer task

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-2913:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Disable auto clean in writer task
> -
>
> Key: HUDI-2913
> URL: https://issues.apache.org/jira/browse/HUDI-2913
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Zhaojing Yu
>Assignee: Zhaojing Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3954) Don't keep the last commit before the earliest commit to retain

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-3954:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Don't keep the last commit before the earliest commit to retain
> ---
>
> Key: HUDI-3954
> URL: https://issues.apache.org/jira/browse/HUDI-3954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: 董可伦
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Don't keep the last commit before the earliest commit to retain
> According to the document of {{{}hoodie.cleaner.commits.retained{}}}:
> Number of commits to retain, without cleaning. This will be retained for 
> num_of_commits * time_between_commits (scheduled). This also directly 
> translates into how much data retention the table supports for incremental 
> queries.
>  
> We only need to keep the number of commit configured through parameters 
> {{{}hoodie.cleaner.commits.retained{}}}.
> And the commit retained by clean is completed.This ensures that “This will be 
> retained for num_of_commits * time_between_commits” in the document.
> So we don't need to keep the last commit before the earliest commit to 
> retain,If we want to keep more versions, we can increase the parameters 
> {{hoodie.cleaner.commits.retained}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-712) Improve exporter performance and memory usage

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-712:

Fix Version/s: 0.13.0
   (was: 0.12.2)

> Improve exporter performance and memory usage
> -
>
> Key: HUDI-712
> URL: https://issues.apache.org/jira/browse/HUDI-712
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> [https://github.com/apache/incubator-hudi/blob/99b7e9eb9ef8827c1e06b7e8621b6be6403b061e/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java#L103-L107]
> The way the data file list for export is collected can be improved due to
>  * not parallelized among partitions
>  * the list can be too large
>  * listing partition to get the latest files requires scanning all files 
> (RFC-15 could solve this)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1570) Add Avg record size in commit metadata

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1570:
-
Fix Version/s: (was: 0.12.2)

> Add Avg record size in commit metadata
> --
>
> Key: HUDI-1570
> URL: https://issues.apache.org/jira/browse/HUDI-1570
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2021-01-31 at 7.05.55 PM.png
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Many users want to understand what would be their avg record size in hudi 
> storage. They need this so that they can deduce their bloom config values. 
>  As of now, there is no easy way to fetch record size for the end user. Even 
> w/ hudi-cli, we could decipher from commit metadata, but we need to make some 
> rough calculation. So, it would be better if we store the avg record size w/ 
> WriteStats (total bytes written/ total records written) , as well as in 
> commit metadata. So, in hudi_cli, we could expose this info along w/ "commit 
> showpartitions" or expose another command "commit showmetadata" or something. 
> As of now, we could calculate the avg size from bytes written/records written 
> from commit metadata. 
> !Screen Shot 2021-01-31 at 7.05.55 PM.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5105) Add Call show_commit_extra_metadata for spark sql

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5105:
-
Fix Version/s: 0.13.0

> Add Call show_commit_extra_metadata for spark sql
> -
>
> Key: HUDI-5105
> URL: https://issues.apache.org/jira/browse/HUDI-5105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5201) add totalRecordsDeleted metric

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5201:
-
Fix Version/s: (was: 0.12.2)

> add totalRecordsDeleted metric
> --
>
> Key: HUDI-5201
> URL: https://issues.apache.org/jira/browse/HUDI-5201
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metrics
>Reporter: Hussein Awala
>Assignee: Hussein Awala
>Priority: Major
>  Labels: pull-request-available
>
> Add missing {{totalRecordsDeleted}} metric to commit action metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5105) Add Call show_commit_extra_metadata for spark sql

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5105:
-
Fix Version/s: (was: 0.12.2)

> Add Call show_commit_extra_metadata for spark sql
> -
>
> Key: HUDI-5105
> URL: https://issues.apache.org/jira/browse/HUDI-5105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5059) Support automatic setting of certain attributes when creating a table in the flash catalog

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5059:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Support automatic setting of certain attributes when creating a table in the 
> flash catalog
> --
>
> Key: HUDI-5059
> URL: https://issues.apache.org/jira/browse/HUDI-5059
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink-sql
>Reporter: waywtdcc
>Priority: Major
> Fix For: 0.13.0
>
>
> Support the automatic setting of certain attributes when creating a table in 
> the flash catalog For example, when creating a hudi catalog, execute some 
> default attributes, such as the number of write.tasks. Automatically bring 
> these attributes when creating tables to reduce development workload



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5048) add CopyToTempView support

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5048:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> add CopyToTempView support
> --
>
> Key: HUDI-5048
> URL: https://issues.apache.org/jira/browse/HUDI-5048
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: scx
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Before, when we used spark sql, we still didn't have a good way to 
> incrementally read and time travel the hudi table. So, I added the 
> CopyToTempView Procedure. This method will register the hudi table with 
> spark's temporary view table, and data development can directly access the 
> view table for different ways of reading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4809) Hudi Support AWS Glue DropPartitions

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4809:
-
Fix Version/s: 0.13.0

> Hudi Support AWS Glue DropPartitions 
> -
>
> Key: HUDI-4809
> URL: https://issues.apache.org/jira/browse/HUDI-4809
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: XixiHua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4809) Hudi Support AWS Glue DropPartitions

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-4809:
-
Fix Version/s: (was: 0.12.2)

> Hudi Support AWS Glue DropPartitions 
> -
>
> Key: HUDI-4809
> URL: https://issues.apache.org/jira/browse/HUDI-4809
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: XixiHua
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5168) Flink metrics integration

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5168:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Flink metrics integration
> -
>
> Key: HUDI-5168
> URL: https://issues.apache.org/jira/browse/HUDI-5168
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: flink, flink-sql
>Reporter: Zhaojing Yu
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5334) Get checkpoint from non-completed instant

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5334:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Get checkpoint from non-completed instant
> -
>
> Key: HUDI-5334
> URL: https://issues.apache.org/jira/browse/HUDI-5334
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Original issue https://github.com/apache/hudi/issues/7375



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5318) Clustering schduling now will list all partition in table when PARTITION_SELECTED is set

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5318:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Clustering schduling now will list all partition in table when 
> PARTITION_SELECTED is set
> 
>
> Key: HUDI-5318
> URL: https://issues.apache.org/jira/browse/HUDI-5318
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently PartitionAwareClusteringPlanStrategy will list all partition in 
> table whether PARTITION_SELECTED is set or not. List all partition in the 
> dataset is a very expensive operation when the number of partition is huge. 
> We can skip list all partition when PARTITION_SELECTED is set, so that 
> clustering scheduling can benefit a lot from  partition pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5229) Add flink avro version entry in root pom

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5229:
-
Fix Version/s: (was: 0.12.2)

> Add flink avro version entry in root pom
> 
>
> Key: HUDI-5229
> URL: https://issues.apache.org/jira/browse/HUDI-5229
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5220) failed to snapshot query in hive when query a empty partition

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5220:
-
Fix Version/s: (was: 0.12.2)

> failed to snapshot query in hive when query a empty partition 
> --
>
> Key: HUDI-5220
> URL: https://issues.apache.org/jira/browse/HUDI-5220
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Reporter: yuehanwang
>Priority: Major
>  Labels: pull-request-available
>
> When query a empty partition hive will return a empty file in split path. 
> This path will be added as a NonHoodieInputPaths. In this case 
> HoodieParquetRealtimeInputFormat read a file split rather than a 
> RealtimeSplit. Throw a exception:
> HoodieRealtimeRecordReader can only work on RealtimeSplit and not with 
> hdfs://test-cluster/tmp/hive/20220520/hive/4273589d-49be-4a60-9890-a29660d81927/hive_2022-11-14_11-32-41_221_5694963332005566615-17/-mr-10004/74adf5bb-b07e-4eac-a90b-1b5a7fc3d5c4/emptyFile:0+466



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5270) Duplicate key error when insert_overwrite same partition in multi writer

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5270:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Duplicate key error when insert_overwrite same partition in multi writer
> 
>
> Key: HUDI-5270
> URL: https://issues.apache.org/jira/browse/HUDI-5270
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer, spark-sql
>Affects Versions: 0.11.0
>Reporter: weiming
>Assignee: weiming
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> If the occ is enabled for hudi spark table, multiple threads insert_overwrite 
> the same partition. The data of the later task should overwrite the data of 
> the previous task. However, an error occurs.
> {code:java}
> // execute sql insert overwrite same partition
> ##THREAD-1 EXECUTE SQL
> insert overwrite table hudi_test_wm1_mor_02 partition (dt = '2021-12-14',hh = 
> '6') select id,name,price,ts from hudi_test_wm1_mor_01 where dt='2021-12-11' 
> and hh ='2';
> ##THREAD-2 EXECUTE SQL
> insert overwrite table hudi_test_wm1_mor_02 partition (dt = '2021-12-14',hh = 
> '6') select id,name,price,ts from hudi_test_wm1_mor_01 where dt='2021-12-11' 
> and hh ='4'; {code}
> {code:java}
> // ERROR LOG
> 22/11/07 15:24:53 ERROR SparkSQLDriver: Failed in [insert overwrite table 
> hudi_test_wm1_mor_02 partition (dt = '2021-12-14',hh = '6') select 
> id,name,price,ts from hudi_test_wm1_mor_01 where dt='2021-12-11' and hh 
> ='4']java.lang.IllegalStateException: Duplicate key 
> [20221107152403967__replacecommit__COMPLETED]at 
> java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
> at java.util.HashMap.merge(HashMap.java:1245)at 
> java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)at 
> java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)  
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) 
>    at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)  
>   at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)   
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:270) 
>    at java.util.Iterator.forEachRemaining(Iterator.java:116)at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)  
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) 
>    at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)  
>   at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)   
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:270) 
>    at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)  
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:244)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:108)
> at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:108)
> at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:102)
> at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:93)
> at 
> 

[jira] [Updated] (HUDI-5174) Clustering w/ two multi-writers could lead to issues

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5174:
-
Fix Version/s: (was: 0.12.2)

> Clustering w/ two multi-writers could lead to issues
> 
>
> Key: HUDI-5174
> URL: https://issues.apache.org/jira/browse/HUDI-5174
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering, table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>
> if two writers have enabled clustering, each could rollback the clustering 
> that the other writer is currently executing and could lead to unrecoverable 
> issues. 
>  
>  
> {code:java}
>  t1   t2
> ➝ t
>  writer1 |-| 
>  writer2 |--|{code}
> lets say writer1 starts a clustering at t1. 
> and then writer2 starts clustering at time t2. at this time, it will rollback 
> the clustering started at time t1. but writer 1 could still be continuing to 
> execute the clustering. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5177) Revisit HiveIncrPullSource and JdbcSource for interleaved inflight commits

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5177:
-
Fix Version/s: (was: 0.12.2)

> Revisit HiveIncrPullSource and JdbcSource for interleaved inflight commits
> --
>
> Key: HUDI-5177
> URL: https://issues.apache.org/jira/browse/HUDI-5177
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Critical
>
> HUDI-5176
> We have fixed the Hudi incremental source when there are inflight commits 
> before completed commits.  We need to revisit the logic for 
> HiveIncrPullSource and JdbcSource as well regarding the same scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5171) Ensure validateTableConfig also checks for partition path field value switch

2022-12-20 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-5171:
-
Fix Version/s: 0.13.0
   (was: 0.12.2)

> Ensure validateTableConfig also checks for partition path field value switch
> 
>
> Key: HUDI-5171
> URL: https://issues.apache.org/jira/browse/HUDI-5171
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.1
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> as of now, validateTableConfig does not consider change in partition path 
> field value switch. we need to consider that as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5443) Fix exception when querying MOR table after applying NestedSchemaPruning optimization

2022-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5443:
--
Sprint: 0.13.0 Final Sprint

> Fix exception when querying MOR table after applying NestedSchemaPruning 
> optimization
> -
>
> Key: HUDI-5443
> URL: https://issues.apache.org/jira/browse/HUDI-5443
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> This has been discovered while working on HUDI-5384.
> After NestedSchemaPruning has been applied successfully, reading from MOR 
> table could encountered following exception when actual delta-log file 
> merging would be performed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   >