[GitHub] [hudi] danny0405 commented on issue #8148: [SUPPORT]

2023-03-09 Thread via GitHub


danny0405 commented on issue #8148:
URL: https://github.com/apache/hudi/issues/8148#issuecomment-1463408930

   That's a nice analysis @kkrugler , let's see if we can solve this in elegant 
way!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5917) MOR table log file has only one replication

2023-03-09 Thread sandy du (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandy du updated HUDI-5917:
---
Description: When mor table enable  HoodieRetryWrapperFileSystem through 
the configuration  `hoodie.filesystem.operation.retry.enable=true` ,log file in 
hdfs only has one replication.  (was: When mor talbe enable  
HoodieRetryWrapperFileSystem through the configuration  
`hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has one 
replication.)

> MOR table log file has only one replication
> ---
>
> Key: HUDI-5917
> URL: https://issues.apache.org/jira/browse/HUDI-5917
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sandy du
>Priority: Major
>  Labels: pull-request-available
>
> When mor table enable  HoodieRetryWrapperFileSystem through the configuration 
>  `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 
> one replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5917) MOR table log file has only one replication

2023-03-09 Thread sandy du (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandy du updated HUDI-5917:
---
Summary: MOR table log file has only one replication  (was: MOR Table Log 
file has only one replication)

> MOR table log file has only one replication
> ---
>
> Key: HUDI-5917
> URL: https://issues.apache.org/jira/browse/HUDI-5917
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sandy du
>Priority: Major
>  Labels: pull-request-available
>
> When mor talbe enable  HoodieRetryWrapperFileSystem through the configuration 
>  `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 
> one replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5740) Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty

2023-03-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5740:
-
Labels: pull-request-available  (was: )

> Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty
> --
>
> Key: HUDI-5740
> URL: https://issues.apache.org/jira/browse/HUDI-5740
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs, deltastreamer
>Reporter: Jonathan Vexler
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The configs in the following classes are implemented not using HoodieConfig, 
> making it impossible to be surfaced on the Configurations page.  We need to 
> refactor the code so that each config property is implemented using 
> ConfigProperty in a corresponding new HoodieConfig class.  Refer to 
> HoodieArchivalConfig for existing implementation of configs.
>  
> InitialCheckPointProvider
> HoodieDeltaStreamer
> HoodieMultiTableDeltaStreamer
> FilebasedSchemaProvider
> HiveSchemaProvider
> JdbcbasedSchemaProvider
> ProtoClassBasedSchemaProvider
> SchemaPostProcessor
> SchemaRegistryProvider
> SparkAvroPostProcessor
> DropColumnSchemaPostProcessor
> BaseSchemaPostProcessorConfig
> KafkaOffsetPostProcessor
> SanitizationUtils
> Also 'hoodie.deltastreamer.multiwriter.source.checkpoint.id' in 
> HoodieWriteConfig



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] lokeshj1703 opened a new pull request, #8152: [HUDI-5740] Refactor Deltastreamer and schema providers to use HoodieConfig/ConfigProperty

2023-03-09 Thread via GitHub


lokeshj1703 opened a new pull request, #8152:
URL: https://github.com/apache/hudi/pull/8152

   ### Change Logs
   
   The configs in the following classes are implemented not using HoodieConfig, 
making it impossible to be surfaced on the Configurations page.  We need to 
refactor the code so that each config property is implemented using 
ConfigProperty in a corresponding new HoodieConfig class.  Refer to 
HoodieArchivalConfig for existing implementation of configs.
   
   InitialCheckPointProvider
   HoodieDeltaStreamer
   HoodieMultiTableDeltaStreamer
   FilebasedSchemaProvider
   HiveSchemaProvider
   JdbcbasedSchemaProvider
   ProtoClassBasedSchemaProvider
   SchemaPostProcessor
   SchemaRegistryProvider
   SparkAvroPostProcessor
   DropColumnSchemaPostProcessor
   BaseSchemaPostProcessorConfig
   
   KafkaOffsetPostProcessor
   
   SanitizationUtils
   
   Also 'hoodie.deltastreamer.multiwriter.source.checkpoint.id' in 
HoodieWriteConfig
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8150:
URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463383059

   
   ## CI report:
   
   * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN
   * 6dc5a2866114879b660baceae026bf8574126af3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15652)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8149:
URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463383019

   
   ## CI report:
   
   * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650)
 
   * 64fff59128deb511ed29c4ac7972345e6dab1bd7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15653)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8150:
URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463376825

   
   ## CI report:
   
   * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN
   * 6dc5a2866114879b660baceae026bf8574126af3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8149:
URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463376789

   
   ## CI report:
   
   * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650)
 
   * 64fff59128deb511ed29c4ac7972345e6dab1bd7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463376698

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * a690c5122694914f975ebbb717e06630ac3b5902 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646)
 
   * a9f08395c3578b1567ec34ed61fb34acc219aa28 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] MrAladdin opened a new issue, #8151: [SUPPORT]org.apache.hudi.exception.HoodieCompactionException: Could not compact /.hoodie/metadata

2023-03-09 Thread via GitHub


MrAladdin opened a new issue, #8151:
URL: https://github.com/apache/hudi/issues/8151

   **Describe the problem you faced**
   hudi metadata table : compaction exception
   
   authenticated : 
   hudi 0.12.2 ok
   hudi 0.13.0 compaction exception
   
   **Expected behavior**
   
   org.apache.hudi.exception.HoodieCompactionException: Could not compact 
/.hoodie/metadata
   
   **Environment Description**
   
   * Hudi version :0.13.0
   
   * Spark version :3.3.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.1.3
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   .option(DataSourceWriteOptions.OPERATION.key(), 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.TABLE_TYPE.key(), 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
   .option("hoodie.index.type", "BUCKET")
   .option("hoodie.index.bucket.engine", "CONSISTENT_HASHING")
   
   **Stacktrace**
   Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
at scala.Option.foreach(Option.scala:407)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at 
org.apache.hudi.data.HoodieJavaRDD.collectAsList(HoodieJavaRDD.java:163)
at 
org.apache.hudi.table.action.compact.RunCompactionActionExecutor.execute(RunCompactionActionExecutor.java:101)
... 66 more
   Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
log file 
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:376)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:198)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:114)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:73)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:464)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:204)
at 
org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$9cd4b1be$1(HoodieCompactor.java:129)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)

[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463368031

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * a690c5122694914f975ebbb717e06630ac3b5902 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8150:
URL: https://github.com/apache/hudi/pull/8150#issuecomment-1463368135

   
   ## CI report:
   
   * b822947584be483fcc23fd1880d2212f31ae386d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?

2023-03-09 Thread via GitHub


danny0405 commented on issue #8071:
URL: https://github.com/apache/hudi/issues/8071#issuecomment-1463363633

   Thanks, for COW table with insert operation, Flink does not use any index, 
so the bucket index does not work here, the write throughput should be high, 
and for UPSERTs with bucket index, if you use the COW, yes, the performance is 
bad because the whole table/partition is almot rewritten each ckp.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8088: [HUDI-5873] The pending compactions of dataset table should not block…

2023-03-09 Thread via GitHub


danny0405 commented on code in PR #8088:
URL: https://github.com/apache/hudi/pull/8088#discussion_r1132010791


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1029,23 +1029,63 @@ protected HoodieData 
prepRecords(MapCases to be handled:
+   * 
+   *   We cannot perform compaction if there are previous inflight 
operations on the dataset. This is because
+   *   a compacted metadata base file at time Tx should represent all the 
actions on the dataset till time Tx;
+   *   In multi-writer scenario, a parallel operation with a greater 
instantTime may have completed creating a
+   *   deltacommit.
+   * 
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
instantTime) {
 // finish off any pending compactions if any from previous attempt.
 writeClient.runAnyPendingCompactions();
 
-String latestDeltaCommitTimeInMetadataTable = 
metadataMetaClient.reloadActiveTimeline()
+HoodieTimeline metadataCompletedDeltaCommitTimeline = 
metadataMetaClient.reloadActiveTimeline()
 .getDeltaCommitTimeline()
-.filterCompletedInstants()
+.filterCompletedInstants();
+String latestDeltaCommitTimeInMetadataTable = 
metadataCompletedDeltaCommitTimeline
 .lastInstant().orElseThrow(() -> new HoodieMetadataException("No 
completed deltacommit in metadata table"))
 .getTimestamp();
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
+Set metadataCompletedDeltaCommits = 
metadataCompletedDeltaCommitTimeline.getInstantsAsStream()
+.map(HoodieInstant::getTimestamp)
+.collect(Collectors.toSet());
+// pending compactions in DT should not block the compaction of MDT.
+// a pending compaction on the DT(for MOR table, this is a common case)
+// could cause the MDT compaction not been triggered in time,
+// the slow compaction progress of MDT can further affect the timeline 
archiving of DT,
+// which would result in both timelines from DT and MDT can not be 
archived timely,
+// that is how the small file issues from both the DT and MDT timelines 
emerge.
+
+// why we could filter out the compaction commit that has not been 
committed into the MDT?
+
+// there are 2 preconditions that need to address first:
+// 1. only the write commits (commit, delta_commit, replace_commit) can 
trigger the MDT compaction;
+// 2. the MDT is always committed before the DT.
+
+// there are 3 cases we want to analyze for a compaction instant from DT:
+// 1. both the DT and MDT does not commit the instant;
+//1.1 the compaction in DT is normal, it just lags long time to finish;
+//1.2 some error happens to the compaction procedure.
+// 2. the MDT committed the compaction instant, while the DT hadn't;
+//2.1 the job crashed suddenly while the compactor tries to commit to 
the DT right after the MDT has been committed;
+//2.2 the job has been canceled manually right after the MDT has been 
committed.
+// 3. both the DT and MDT commit the instant.
+
+// the 3rd case should be okay, now let's analyze the first 2 cases:
+//
+// the 1st case: if the instant has not been committed yet, the compaction 
of MDT would just ignore the instant,
+// so the pending instant can not be compacted into the HFile, the instant 
should also not be archived by both of the DT and the MDT(that is how the 
archival mechanism works),
+// the log reader of MDT would ignore the instant correctly, the result 
view should work!
+
+// the 2nd case: we can not trigger compact, because once the MDT 
triggers, the MDT archiver can then archive the instant, but this instant has 
not been committed in the DT,
+// the MDT reader can not filter out the instant correctly, another reason 
is once the instant is compacted into HFile, the subsequent rollback from DT 
may try to look up
+// the files to be rolled back, an exception could throw(although the 
default behavior is not to throws).
+

Review Comment:
   Let me explain the procedure a little more with a demo:
   
   ```java
   delta_c1 (F3, F4) (MDT)
   delta_c1 (F1, F2) (DT)
   
   c2.inflight (compaction triggers in DT)
   
   delta_c3 (F7, F8) (MDT)
   delta_c3 (F5, F6) (DT)
   
   c2 (F7, F8) (compaction complete in MDT)
   c2 failes to commit to DT
   
   delta_c4 (F9, F10) (MDT)
   -- can we trigger MDT compaction here? The answer is yes
   1. c2 in DT would block the archiving of C2 in MDT
   2. the MDT reader would ignore the C2 too because it is filtered by the c2 
on DT timeline, so the compaction does not include c2
   delta_c4 (F11, F12) (DT)
   
   r5 (to rollback c2) (MDT)
   -F7, -F8
   r5 (to rollback c2) (DT)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the sp

[GitHub] [hudi] danny0405 commented on a diff in pull request #8088: [HUDI-5873] The pending compactions of dataset table should not block…

2023-03-09 Thread via GitHub


danny0405 commented on code in PR #8088:
URL: https://github.com/apache/hudi/pull/8088#discussion_r1132010791


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1029,23 +1029,63 @@ protected HoodieData 
prepRecords(MapCases to be handled:
+   * 
+   *   We cannot perform compaction if there are previous inflight 
operations on the dataset. This is because
+   *   a compacted metadata base file at time Tx should represent all the 
actions on the dataset till time Tx;
+   *   In multi-writer scenario, a parallel operation with a greater 
instantTime may have completed creating a
+   *   deltacommit.
+   * 
*/
   protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String 
instantTime) {
 // finish off any pending compactions if any from previous attempt.
 writeClient.runAnyPendingCompactions();
 
-String latestDeltaCommitTimeInMetadataTable = 
metadataMetaClient.reloadActiveTimeline()
+HoodieTimeline metadataCompletedDeltaCommitTimeline = 
metadataMetaClient.reloadActiveTimeline()
 .getDeltaCommitTimeline()
-.filterCompletedInstants()
+.filterCompletedInstants();
+String latestDeltaCommitTimeInMetadataTable = 
metadataCompletedDeltaCommitTimeline
 .lastInstant().orElseThrow(() -> new HoodieMetadataException("No 
completed deltacommit in metadata table"))
 .getTimestamp();
-List pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
+Set metadataCompletedDeltaCommits = 
metadataCompletedDeltaCommitTimeline.getInstantsAsStream()
+.map(HoodieInstant::getTimestamp)
+.collect(Collectors.toSet());
+// pending compactions in DT should not block the compaction of MDT.
+// a pending compaction on the DT(for MOR table, this is a common case)
+// could cause the MDT compaction not been triggered in time,
+// the slow compaction progress of MDT can further affect the timeline 
archiving of DT,
+// which would result in both timelines from DT and MDT can not be 
archived timely,
+// that is how the small file issues from both the DT and MDT timelines 
emerge.
+
+// why we could filter out the compaction commit that has not been 
committed into the MDT?
+
+// there are 2 preconditions that need to address first:
+// 1. only the write commits (commit, delta_commit, replace_commit) can 
trigger the MDT compaction;
+// 2. the MDT is always committed before the DT.
+
+// there are 3 cases we want to analyze for a compaction instant from DT:
+// 1. both the DT and MDT does not commit the instant;
+//1.1 the compaction in DT is normal, it just lags long time to finish;
+//1.2 some error happens to the compaction procedure.
+// 2. the MDT committed the compaction instant, while the DT hadn't;
+//2.1 the job crashed suddenly while the compactor tries to commit to 
the DT right after the MDT has been committed;
+//2.2 the job has been canceled manually right after the MDT has been 
committed.
+// 3. both the DT and MDT commit the instant.
+
+// the 3rd case should be okay, now let's analyze the first 2 cases:
+//
+// the 1st case: if the instant has not been committed yet, the compaction 
of MDT would just ignore the instant,
+// so the pending instant can not be compacted into the HFile, the instant 
should also not be archived by both of the DT and the MDT(that is how the 
archival mechanism works),
+// the log reader of MDT would ignore the instant correctly, the result 
view should work!
+
+// the 2nd case: we can not trigger compact, because once the MDT 
triggers, the MDT archiver can then archive the instant, but this instant has 
not been committed in the DT,
+// the MDT reader can not filter out the instant correctly, another reason 
is once the instant is compacted into HFile, the subsequent rollback from DT 
may try to look up
+// the files to be rolled back, an exception could throw(although the 
default behavior is not to throws).
+

Review Comment:
   Let me explain the procedure a little more with a demo:
   
   ```java
   delta_c1 (F3, F4) (MDT)
   delta_c1 (F1, F2) (DT)
   c2.inflight (compaction triggers in DT)
   delta_c3 (F7, F8) (MDT)
   delta_c3 (F5, F6) (DT)
   c2 (F7, F8) (compaction complete in MDT)
   c2 failes to commit to DT
   delta_c4 (F9, F10) (MDT)
   -- can we trigger MDT compaction here? The answer is yes
   1. c2 in DT would block the archiving of C2 in MDT
   2. the MDT reader would ignore the C2 too because it is filtered by the c2 
on DT timeline, so the compaction does not include c2
   delta_c4 (F11, F12) (DT)
   r5 (to rollback c2) (MDT)
   -F7, -F8
   r5 (to rollback c2) (DT)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To 

[jira] [Closed] (HUDI-5851) Refactor ExpressionEvaluators to split into 2 phase: evaluator conversion and evaluator execution

2023-03-09 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5851.

Fix Version/s: 0.14.0
   Resolution: Fixed

Fixed via master branch: 79428391bac7277ffa9e18c75594a6fb9b8c5665

> Refactor ExpressionEvaluators to split into 2 phase: evaluator conversion and 
> evaluator execution
> -
>
> Key: HUDI-5851
> URL: https://issues.apache.org/jira/browse/HUDI-5851
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: flink, flink-sql
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xuzifu666 commented on a diff in pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


xuzifu666 commented on code in PR #8133:
URL: https://github.com/apache/hudi/pull/8133#discussion_r1132008580


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -115,6 +116,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
 })
   }
 
+  test("Test MergeInto with more than once update actions") {
+withRecordType()(withTempDir {tmp =>
+  val conf = new 
SparkConf().setAppName("insertDatasToHudi").setMaster("local[*]")
+  val spark = SparkSession.builder().config(conf)
+.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")

Review Comment:
   ok,thanks. I add a Todo issue 
https://issues.apache.org/jira/browse/HUDI-5918 @XuQianJin-Stars 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5918) merge into in mutil update action without preCombine key

2023-03-09 Thread xy (Jira)
xy created HUDI-5918:


 Summary: merge into in mutil update action without preCombine key
 Key: HUDI-5918
 URL: https://issues.apache.org/jira/browse/HUDI-5918
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: xy


merge into in mutil update action without preCombine key



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once (#8051)

2023-03-09 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 79428391bac [HUDI-5851] Improvement of data skipping, only converts 
expressions to evaluators once (#8051)
79428391bac is described below

commit 79428391bac7277ffa9e18c75594a6fb9b8c5665
Author: Jing Zhang 
AuthorDate: Fri Mar 10 14:53:16 2023 +0800

[HUDI-5851] Improvement of data skipping, only converts expressions to 
evaluators once (#8051)

* Add log to FileIndex about the data skipping info
* Move all evaluators and relative utility in one class
---
 .../java/org/apache/hudi/source/DataPruner.java| 140 +
 .../apache/hudi/source/ExpressionEvaluators.java   | 576 
 .../java/org/apache/hudi/source/FileIndex.java |  46 +-
 .../org/apache/hudi/source/stats/ColumnStats.java  |  72 +++
 .../hudi/source/stats/ExpressionEvaluator.java | 605 -
 .../hudi/source/TestExpressionEvaluators.java  | 408 ++
 .../hudi/source/stats/TestExpressionEvaluator.java | 403 --
 .../apache/hudi/table/ITTestHoodieDataSource.java  |   7 +
 8 files changed, 1230 insertions(+), 1027 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java
new file mode 100644
index 000..605fcdf7fb0
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/DataPruner.java
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.source;
+
+import org.apache.hudi.source.stats.ColumnStats;
+import org.apache.hudi.util.ExpressionUtils;
+
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.expressions.ResolvedExpression;
+import org.apache.flink.table.types.logical.DecimalType;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.table.types.logical.TimestampType;
+
+import java.io.Serializable;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.apache.hudi.source.ExpressionEvaluators.fromExpression;
+
+/**
+ * Utility to do data skipping.
+ */
+public class DataPruner implements Serializable {
+  private static final long serialVersionUID = 1L;
+
+  private final String[] referencedCols;
+  private final List evaluators;
+
+  private DataPruner(String[] referencedCols, 
List evaluators) {
+this.referencedCols = referencedCols;
+this.evaluators = evaluators;
+  }
+
+  /**
+   * Filters the index row with specific data filters and query fields.
+   *
+   * @param indexRowThe index row
+   * @param queryFields The query fields referenced by the filters
+   * @return true if the index row should be considered as a candidate
+   */
+  public boolean test(RowData indexRow, RowType.RowField[] queryFields) {
+Map columnStatsMap = convertColumnStats(indexRow, 
queryFields);
+for (ExpressionEvaluators.Evaluator evaluator : evaluators) {
+  if (!evaluator.eval(columnStatsMap)) {
+return false;
+  }
+}
+return true;
+  }
+
+  public String[] getReferencedCols() {
+return referencedCols;
+  }
+
+  public static DataPruner newInstance(List filters) {
+if (filters == null || filters.size() == 0) {
+  return null;
+}
+String[] referencedCols = ExpressionUtils.referencedColumns(filters);
+if (referencedCols.length == 0) {
+  return null;
+}
+List evaluators = fromExpression(filters);
+return new DataPruner(referencedCols, evaluators);
+  }
+
+  public static Map convertColumnStats(RowData indexRow, 
RowType.RowField[] queryFields) {
+if (indexRow == null || queryFields == null) {
+  throw new IllegalArgumentException("Index Row and query fields could not 
be null.");
+}
+Map mapping = new LinkedHashMap<>();
+for (int i = 0; i < queryFields.length; i++) {
+  String name = queryFields[i].ge

[GitHub] [hudi] danny0405 merged pull request #8051: [HUDI-5851] Improvement of data skipping, only converts expressions to evaluators once

2023-03-09 Thread via GitHub


danny0405 merged PR #8051:
URL: https://github.com/apache/hudi/pull/8051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


danny0405 commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463348786

   > WriteProfiles.getCommitMetadata
   
   I see, in thie PR https://github.com/apache/hudi/pull/7055, I have moved the 
utilities method into another class which is located in `hudi-common`, so this 
should not be a problem anymore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


danny0405 commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463347507

   You are right, the bundle jar should keep more slim.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed issue #8136: [SUPPORT] Wrong type returned by ParquetColumnarRowSplitReader in hudi-flink1.16.x code

2023-03-09 Thread via GitHub


danny0405 closed issue #8136: [SUPPORT] Wrong type returned by 
ParquetColumnarRowSplitReader in hudi-flink1.16.x code
URL: https://github.com/apache/hudi/issues/8136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5917) MOR Table Log file has only one replication

2023-03-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5917:
-
Labels: pull-request-available  (was: )

> MOR Table Log file has only one replication
> ---
>
> Key: HUDI-5917
> URL: https://issues.apache.org/jira/browse/HUDI-5917
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sandy du
>Priority: Major
>  Labels: pull-request-available
>
> When mor talbe enable  HoodieRetryWrapperFileSystem through the configuration 
>  `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 
> one replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] sandyfog opened a new pull request, #8150: [HUDI-5917] Fix HoodieRetryWrapperFileSystem getDefaultReplication

2023-03-09 Thread via GitHub


sandyfog opened a new pull request, #8150:
URL: https://github.com/apache/hudi/pull/8150

   ### Change Logs
   
   When mor talbe enable  HoodieRetryWrapperFileSystem through the 
configuration  `hoodie.filesystem.operation.retry.enable=true` ,log file in 
hdfs only has one replication.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5917) MOR Table Log file has only one replication

2023-03-09 Thread sandy du (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandy du updated HUDI-5917:
---
Description: When mor talbe enable  HoodieRetryWrapperFileSystem through 
the configuration  `hoodie.filesystem.operation.retry.enable=true` ,log file in 
hdfs only has one replication.  (was: When mor talbe enable  
HoodieRetryWrapperFileSystem through the configuration  
`hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 
replication.)
Summary: MOR Table Log file has only one replication  (was: MOR Table 
Log file have only one replication)

> MOR Table Log file has only one replication
> ---
>
> Key: HUDI-5917
> URL: https://issues.apache.org/jira/browse/HUDI-5917
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sandy du
>Priority: Major
>
> When mor talbe enable  HoodieRetryWrapperFileSystem through the configuration 
>  `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 
> one replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5917) MOR Table Log file have only one replication

2023-03-09 Thread sandy du (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandy du updated HUDI-5917:
---
Summary: MOR Table Log file have only one replication  (was: MOR )

> MOR Table Log file have only one replication
> 
>
> Key: HUDI-5917
> URL: https://issues.apache.org/jira/browse/HUDI-5917
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sandy du
>Priority: Major
>
> When mor talbe enable  HoodieRetryWrapperFileSystem through the configuration 
>  `hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 
> replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5917) MOR

2023-03-09 Thread sandy du (Jira)
sandy du created HUDI-5917:
--

 Summary: MOR 
 Key: HUDI-5917
 URL: https://issues.apache.org/jira/browse/HUDI-5917
 Project: Apache Hudi
  Issue Type: Bug
Reporter: sandy du


When mor talbe enable  HoodieRetryWrapperFileSystem through the configuration  
`hoodie.filesystem.operation.retry.enable=true` ,log file in hdfs only has 1 
replication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row

2023-03-09 Thread via GitHub


hudi-bot commented on PR #7956:
URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463323664

   
   ## CI report:
   
   * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15645)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


XuQianJin-Stars commented on code in PR #8133:
URL: https://github.com/apache/hudi/pull/8133#discussion_r1131979301


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -115,6 +116,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
 })
   }
 
+  test("Test MergeInto with more than once update actions") {
+withRecordType()(withTempDir {tmp =>
+  val conf = new 
SparkConf().setAppName("insertDatasToHudi").setMaster("local[*]")
+  val spark = SparkSession.builder().config(conf)
+.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")

Review Comment:
   Both `conf` and `spark` can be removed, both in the `HoodieSparkSqlTestBase` 
class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kkrugler commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


kkrugler commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463256084

   The `hudi-flink-bundle` pom has what seems like a very long list of 
transitive dependencies (from running `mvn dependency:tree` in the 
`packaging/flink-hudi-bundle/` directory). I'm wondering why you don't think 
this would pull in jars that create conflicts with other jars being used in a 
workflow...
   
   ```
   [INFO] org.apache.hudi:hudi-flink1.16-bundle:jar:0.14.0-SNAPSHOT
   [INFO] +- org.apache.hudi:hudi-common:jar:0.14.0-SNAPSHOT:compile
   [INFO] |  +- org.openjdk.jol:jol-core:jar:0.16:compile
   [INFO] |  +- com.github.ben-manes.caffeine:caffeine:jar:2.9.1:compile
   [INFO] |  |  +- org.checkerframework:checker-qual:jar:3.10.0:compile
   [INFO] |  |  \- 
com.google.errorprone:error_prone_annotations:jar:2.5.1:compile
   [INFO] |  +- org.apache.httpcomponents:fluent-hc:jar:4.4.1:compile
   [INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
   [INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.4.1:compile
   [INFO] |  +- org.apache.hbase:hbase-client:jar:2.4.9:compile
   [INFO] |  |  +- 
org.apache.hbase.thirdparty:hbase-shaded-protobuf:jar:3.5.1:compile
   [INFO] |  |  +- org.apache.hbase:hbase-common:jar:2.4.9:compile
   [INFO] |  |  |  +- org.apache.hbase:hbase-logging:jar:2.4.9:compile
   [INFO] |  |  |  \- 
org.apache.hbase.thirdparty:hbase-shaded-gson:jar:3.5.1:compile
   [INFO] |  |  +- org.apache.hbase:hbase-hadoop-compat:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-hadoop2-compat:jar:2.4.9:compile
   [INFO] |  |  |  \- javax.activation:javax.activation-api:jar:1.2.0:runtime
   [INFO] |  |  +- org.apache.hbase:hbase-protocol-shaded:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-protocol:jar:2.4.9:compile
   [INFO] |  |  +- 
org.apache.hbase.thirdparty:hbase-shaded-miscellaneous:jar:3.5.1:compile
   [INFO] |  |  +- 
org.apache.hbase.thirdparty:hbase-shaded-netty:jar:3.5.1:compile
   [INFO] |  |  +- org.apache.htrace:htrace-core4:jar:4.2.0-incubating:compile
   [INFO] |  |  +- org.jruby.jcodings:jcodings:jar:1.0.55:compile
   [INFO] |  |  +- org.jruby.joni:joni:jar:2.1.31:compile
   [INFO] |  |  +- org.apache.commons:commons-crypto:jar:1.0.0:compile
   [INFO] |  |  \- org.apache.hadoop:hadoop-auth:jar:2.10.1:provided
   [INFO] |  | +- com.nimbusds:nimbus-jose-jwt:jar:7.9:provided
   [INFO] |  | |  \- 
com.github.stephenc.jcip:jcip-annotations:jar:1.0-1:provided
   [INFO] |  | \- 
org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:provided
   [INFO] |  |+- 
org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:provided
   [INFO] |  |+- 
org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:provided
   [INFO] |  |\- 
org.apache.directory.api:api-util:jar:1.0.0-M20:provided
   [INFO] |  +- org.apache.hbase:hbase-server:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-http:jar:2.4.9:compile
   [INFO] |  |  |  +- 
org.apache.hbase.thirdparty:hbase-shaded-jetty:jar:3.5.1:compile
   [INFO] |  |  |  +- 
org.apache.hbase.thirdparty:hbase-shaded-jersey:jar:3.5.1:compile
   [INFO] |  |  |  |  +- jakarta.ws.rs:jakarta.ws.rs-api:jar:2.1.6:compile
   [INFO] |  |  |  |  +- 
jakarta.annotation:jakarta.annotation-api:jar:1.3.5:compile
   [INFO] |  |  |  |  +- 
jakarta.validation:jakarta.validation-api:jar:2.0.2:compile
   [INFO] |  |  |  |  \- 
org.glassfish.hk2.external:jakarta.inject:jar:2.6.1:compile
   [INFO] |  |  |  \- javax.ws.rs:javax.ws.rs-api:jar:2.1.1:compile
   [INFO] |  |  +- org.apache.hbase:hbase-procedure:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-zookeeper:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-replication:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-metrics-api:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-metrics:jar:2.4.9:compile
   [INFO] |  |  +- org.apache.hbase:hbase-asyncfs:jar:2.4.9:compile
   [INFO] |  |  +- org.glassfish.web:javax.servlet.jsp:jar:2.3.2:compile
   [INFO] |  |  |  \- org.glassfish:javax.el:jar:3.0.1-b12:provided
   [INFO] |  |  +- javax.servlet.jsp:javax.servlet.jsp-api:jar:2.3.1:compile
   [INFO] |  |  +- org.apache.commons:commons-math3:jar:3.6.1:compile
   [INFO] |  |  +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:compile
   [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.10.0:compile
   [INFO] |  +- commons-io:commons-io:jar:2.11.0:compile
   [INFO] |  +- org.lz4:lz4-java:jar:1.8.0:compile
   [INFO] |  \- com.lmax:disruptor:jar:3.4.2:compile
   [INFO] +- org.apache.hudi:hudi-client-common:jar:0.14.0-SNAPSHOT:compile
   [INFO] |  +- com.github.davidmoten:hilbert-curve:jar:0.2.2:compile
   [INFO] |  |  \- com.github.davidmoten:guava-mini:jar:0.1.3:compile
   [INFO] |  +- io.dropwizard.metrics:metrics-graphite:jar:4.1.1:compile
   [INFO] |  +- io.dropwizard.metrics:metrics-core:jar:4.1.1:compile
   [INFO] |  +- io.dropwizard.metrics:metrics-jmx:jar:4.1.1:compile
   [IN

[GitHub] [hudi] kkrugler commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


kkrugler commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463250189

   I was also confused by that. I think when the `HoodieInputFormatUtils` class 
is loaded via the call from `WriteProfiles. getCommitMetadata()` to 
`HoodieInputFormatUtils.getCommitMetadata()`, this indirectly triggers a 
reference to `MapredParquetInputFormat` (e.g. maybe through a static class 
reference?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


danny0405 commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463239717

   The error stack trace confused me a log, because 
`WriteProfiles.getCommitMetadata` does not depend on the 
`MapredParquetInputFormat` in the code path, why it tries to load it then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8149:
URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463234399

   
   ## CI report:
   
   * b04749aba0c507eb67fd6dd756e21ed7f1e3535e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15650)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8139: [HUDI-5909] Reuse hive client if possible

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8139:
URL: https://github.com/apache/hudi/pull/8139#issuecomment-1463234376

   
   ## CI report:
   
   * 0bcd6490f856475266dfff3882728aa1392727f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15628)
 
   * 075563866d156e36afe34780d5fb132d6da57251 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15649)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1463234314

   
   ## CI report:
   
   * 35aed635391309c3c6c4b3794044bba53b3468ef Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15603)
 
   * 9dfbe3e6135456e7f8c79513270eb5e7e4ed123d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15648)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8149:
URL: https://github.com/apache/hudi/pull/8149#issuecomment-1463230874

   
   ## CI report:
   
   * b04749aba0c507eb67fd6dd756e21ed7f1e3535e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8139: [HUDI-5909] Reuse hive client if possible

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8139:
URL: https://github.com/apache/hudi/pull/8139#issuecomment-1463230840

   
   ## CI report:
   
   * 0bcd6490f856475266dfff3882728aa1392727f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15628)
 
   * 075563866d156e36afe34780d5fb132d6da57251 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1463230758

   
   ## CI report:
   
   * 35aed635391309c3c6c4b3794044bba53b3468ef Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15603)
 
   * 9dfbe3e6135456e7f8c79513270eb5e7e4ed123d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463226268

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * 0268541001db5b561328bdf9390ee2cb5e92 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644)
 
   * a690c5122694914f975ebbb717e06630ac3b5902 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15646)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5915) listStatus error caused by minio storage

2023-03-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5915:
-
Labels: pull-request-available  (was: )

> listStatus error caused by minio storage
> 
>
> Key: HUDI-5915
> URL: https://issues.apache.org/jira/browse/HUDI-5915
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: linfey.nie
>Assignee: linfey.nie
>Priority: Major
>  Labels: pull-request-available
>
> When the storage is minio, the empty folder is assumed not to exist, causing 
> listStatus to report an error and causing the entire program to break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] linfey90 opened a new pull request, #8149: [HUDI-5915] Fixed load ckpMeatadata error when using minio

2023-03-09 Thread via GitHub


linfey90 opened a new pull request, #8149:
URL: https://github.com/apache/hudi/pull/8149

   ### Change Logs
   
   When the storage is minio, the empty folder is assumed not to exist, causing 
listStatus to report an error and causing the entire program to break.
   When we created the table, ckp_meta was empty, causing an error when we 
called listStatus later (such as insert). This pr fix
   
   ### Impact
   
   no
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5916) flink bundle jar includes the hive-exec core by default

2023-03-09 Thread Danny Chen (Jira)
Danny Chen created HUDI-5916:


 Summary: flink bundle jar includes the hive-exec core by default
 Key: HUDI-5916
 URL: https://issues.apache.org/jira/browse/HUDI-5916
 Project: Apache Hudi
  Issue Type: Improvement
  Components: dependencies
Reporter: Danny Chen
 Fix For: 0.13.1, 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on issue #8147: [SUPPORT] Missing dependency on hive-exec (core)

2023-03-09 Thread via GitHub


danny0405 commented on issue #8147:
URL: https://github.com/apache/hudi/issues/8147#issuecomment-1463211902

   On cluster, you should use the bundle jar instead, and yeah, the default 
bundler jar does not package the hive-exec, which should be fixed: 
https://issues.apache.org/jira/browse/HUDI-5916
   
   `hudi-flink` pom already includes the `hive-exec` dependency: 
https://github.com/apache/hudi/blob/2675118d95c7a087cd9222a05cd7376eb0a31aad/hudi-flink-datasource/hudi-flink/pom.xml#L287,
 but it does not package into the released jar, that is by design, we only 
introduce the hive jar into the bundle jars.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-5915) listStatus error caused by minio storage

2023-03-09 Thread linfey.nie (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linfey.nie reassigned HUDI-5915:


Assignee: linfey.nie

> listStatus error caused by minio storage
> 
>
> Key: HUDI-5915
> URL: https://issues.apache.org/jira/browse/HUDI-5915
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: linfey.nie
>Assignee: linfey.nie
>Priority: Major
>
> When the storage is minio, the empty folder is assumed not to exist, causing 
> listStatus to report an error and causing the entire program to break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5915) listStatus error caused by minio storage

2023-03-09 Thread linfey.nie (Jira)
linfey.nie created HUDI-5915:


 Summary: listStatus error caused by minio storage
 Key: HUDI-5915
 URL: https://issues.apache.org/jira/browse/HUDI-5915
 Project: Apache Hudi
  Issue Type: Bug
Reporter: linfey.nie


When the storage is minio, the empty folder is assumed not to exist, causing 
listStatus to report an error and causing the entire program to break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5914) Fix for RowData class cast exception

2023-03-09 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5914.

Resolution: Fixed

Fixed via master branch: 2675118d95c7a087cd9222a05cd7376eb0a31aad

> Fix for RowData class cast exception
> 
>
> Key: HUDI-5914
> URL: https://issues.apache.org/jira/browse/HUDI-5914
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463199917

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639)
 
   * 0268541001db5b561328bdf9390ee2cb5e92 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644)
 
   * a690c5122694914f975ebbb717e06630ac3b5902 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (bab75b6c60c -> 2675118d95c)

2023-03-09 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from bab75b6c60c [HUDI-4911] Following the first patch, fix the inefficient 
code (#8127)
 add 2675118d95c [HUDI-5941] Fix for RowData class cast exception (#8145)

No new revisions were added by this update.

Summary of changes:
 .../table/format/cow/vector/reader/ParquetColumnarRowSplitReader.java  | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)



[GitHub] [hudi] danny0405 merged pull request #8145: [HUDI-5941] Fix for RowData class cast exception

2023-03-09 Thread via GitHub


danny0405 merged PR #8145:
URL: https://github.com/apache/hudi/pull/8145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row

2023-03-09 Thread via GitHub


hudi-bot commented on PR #7956:
URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463195659

   
   ## CI report:
   
   * 5bd4d5c4de8fc54bf93fb7fd252b6e61fda85373 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15194)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15233)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15247)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15641)
 
   * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15645)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8145: [HUDI-5941] Fix for RowData class cast exception

2023-03-09 Thread via GitHub


danny0405 commented on PR #8145:
URL: https://github.com/apache/hudi/pull/8145#issuecomment-1463195670

   The test failure: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15642&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a&l=682866
   
   is not caused by the change, so would merge it soon~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5914) Fix for RowData class cast exception

2023-03-09 Thread Danny Chen (Jira)
Danny Chen created HUDI-5914:


 Summary: Fix for RowData class cast exception
 Key: HUDI-5914
 URL: https://issues.apache.org/jira/browse/HUDI-5914
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: Danny Chen
 Fix For: 0.13.1, 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row

2023-03-09 Thread via GitHub


hudi-bot commented on PR #7956:
URL: https://github.com/apache/hudi/pull/7956#issuecomment-1463191824

   
   ## CI report:
   
   * 5bd4d5c4de8fc54bf93fb7fd252b6e61fda85373 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15194)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15233)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15247)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15641)
 
   * 6dc701ed6011cb5983de68e88b9a67522d1e8db3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463187972

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639)
 
   * 0268541001db5b561328bdf9390ee2cb5e92 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15644)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


xuzifu666 commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463177403

   > Will cause data quality problems if we only remove the check, if source 
table without precombineField, look like hudi will add the first updateAction 
assignments vlaue expre which key is target precombineField to source df, 
because we need dedup before use payload And need add more test: cow/mor
   > 
   > * different updateAction with diff precombine field expr
   > * source table without precombineField (like target.precombineField = 
source.otherfield)
   >   a simple test like this:
   > 
   > ```sql
   > merge into $cowTableName t0
   > using (
   >   select 1 as id, 'a1_n_6' as name, 6 as price, 1010 as v_ts, '1' as flag 
union
   >   select 2 as id, 'a2_n_6' as name, 6 as price, 1010 as v_ts, '2' as flag 
union
   >   select 6 as id, 'a3_n_6' as name, 6 as price, 1010 as v_ts, '1' as flag
   >   ) s0
   >on s0.id = t0.id
   >when matched and flag = '1' then update set
   >id = s0.id, name = s0.name, ts = 1003
   >when matched and flag = '2' then update set
   >id = s0.id, price = s0.price, ts = s0.v_ts + 2
   >when not matched and flag = '1' then insert *
   > ```
   
   yes,but mostly business is upsert only one record. i thought this is not 
impact bussiness in one record upsert


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuoluoy commented on issue #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release

2023-03-09 Thread via GitHub


zhuoluoy commented on issue #7417:
URL: https://github.com/apache/hudi/issues/7417#issuecomment-1463139378

   Should we open an Apache JIRA for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuoluoy commented on issue #7417: [SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release

2023-03-09 Thread via GitHub


zhuoluoy commented on issue #7417:
URL: https://github.com/apache/hudi/issues/7417#issuecomment-1463137341

   Actually, for legacy MapReduce, This patch is very important. Without this 
patch, HoodiROTablePathFilter will be thousands times slower.
   
   Can we just brinig back https://github.com/apache/hudi/pull/3719 and fix the 
NPE?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuoluoy commented on pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests

2023-03-09 Thread via GitHub


zhuoluoy commented on PR #3719:
URL: https://github.com/apache/hudi/pull/3719#issuecomment-1463135838

   Actually, for legacy MapReduce. This patch is very important. Without this 
patch, HoodiROTablePathFilter will be thousands times slower.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8133: [HUDI-5904] support more than one update actions in merge into table

2023-03-09 Thread via GitHub


hudi-bot commented on PR #8133:
URL: https://github.com/apache/hudi/pull/8133#issuecomment-1463129575

   
   ## CI report:
   
   * 8e3fad5fa9e9c64e7e345a317865f6fe6a9a7620 UNKNOWN
   * 5b8a43f4b2f18352738b6e9c9a183a1bde5c4540 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15639)
 
   * 0268541001db5b561328bdf9390ee2cb5e92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-1243) Debug test-suite docker execution

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1243:
-
Issue Type: Task  (was: Bug)

> Debug test-suite docker execution
> -
>
> Key: HUDI-1243
> URL: https://issues.apache.org/jira/browse/HUDI-1243
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing, tests-ci
>Affects Versions: 0.8.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
>
> Debug and fix test-suite docker execution. We should have a smooth run where 
> in end to end COW and MOR test suite runs w/o any issues in our local dev box 
> (laptop) 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1243) Debug test-suite docker execution

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1243:
-
Fix Version/s: (was: 0.13.1)

> Debug test-suite docker execution
> -
>
> Key: HUDI-1243
> URL: https://issues.apache.org/jira/browse/HUDI-1243
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing, tests-ci
>Affects Versions: 0.8.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
>
> Debug and fix test-suite docker execution. We should have a smooth run where 
> in end to end COW and MOR test suite runs w/o any issues in our local dev box 
> (laptop) 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5824) COMBINE_BEFORE_UPSERT=false option does not work for upsert

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5824:
-
Priority: Critical  (was: Minor)

> COMBINE_BEFORE_UPSERT=false option does not work for upsert
> ---
>
> Key: HUDI-5824
> URL: https://issues.apache.org/jira/browse/HUDI-5824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.1, 0.12.2, 0.13.0
>Reporter: kazdy
>Assignee: kazdy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
> shouldCombine does not take into account the situation where the write 
> operation is UPSERT but COMBINE_BEFORE_UPSERT is false.
> Currently, Hudi always combines records on UPSERT, and option 
> COMBINE_BEFORE_UPSERT is not honored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4733) Flag emitDelete is inconsistent in HoodieTableSource and MergeOnReadInputFormat

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4733:
-
Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flag emitDelete is inconsistent in HoodieTableSource and 
> MergeOnReadInputFormat
> ---
>
> Key: HUDI-4733
> URL: https://issues.apache.org/jira/browse/HUDI-4733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, flink-sql
>Reporter: nonggia.liang
>Assignee: Zhaojing Yu
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: image 1.png
>
>
> When reading a MOR table in flink, we encountered an exception from flink 
> runtime ( as shown in image1), which complained the table source should not 
> emit a retract record.
> !image 1.png!
> I think here is the cause, in HoodieTableSource:
> {code:java}
> @Override
> public ChangelogMode getChangelogMode() {
>   // when read as streaming and changelog mode is enabled, emit as FULL mode;
>   // when all the changes are compacted or read as batch, emit as INSERT mode.
>   return OptionsResolver.emitChangelog(conf) ? ChangelogModes.FULL : 
> ChangelogMode.insertOnly();
> } {code}
> {code:java}
> private InputFormat getStreamInputFormat() { 
> ...
> if (FlinkOptions.QUERY_TYPE_SNAPSHOT.equals(queryType)) { 
>   final HoodieTableType tableType = 
> HoodieTableType.valueOf(this.conf.getString(FlinkOptions.TABLE_TYPE)); 
>   boolean emitDelete = tableType == HoodieTableType.MERGE_ON_READ; 
>   return mergeOnReadInputFormat(rowType, requiredRowType, tableAvroSchema, 
> rowDataType, Collections.emptyList(), emitDelete); }
> ...
>  }
> {code}
> With these options:
> {{'table.type'}} {{= }}{{'MERGE_ON_READ'}}
> {{'read.streaming.enabled'}} {{= }}{{'true'}}
> {{The HoodieTableSource}} annouces it has only INSERT changelog, 
> but MergeOnReadInputFormat will emit delete.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3616) Ingestigate mor async compact integ test failure

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3616:
-
Fix Version/s: 0.14.0
   (was: 0.13.1)

> Ingestigate mor async compact integ test failure
> 
>
> Key: HUDI-3616
> URL: https://issues.apache.org/jira/browse/HUDI-3616
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Minor
> Fix For: 0.14.0
>
>
> mor async compact integ test validation is failing. 
>  
> {code:java}
> 22/03/14 01:31:28 WARN DagNode: Validation using data from input path 
> /home/hadoop/staging/input//*/*
> 266722/03/14 01:31:28 INFO ValidateDatasetNode: Validate data in target hudi 
> path /home/hadoop/staging/output//*/*/*
> 266822/03/14 01:31:31 ERROR DagNode: Data set validation failed. Total count 
> in hudi 64400, input df count 64400
> 266922/03/14 01:31:31 INFO DagScheduler: Forcing shutdown of executor 
> service, this might kill running tasks
> 267022/03/14 01:31:31 ERROR HoodieTestSuiteJob: Failed to run Test Suite 
> 2671java.util.concurrent.ExecutionException: java.lang.AssertionError: Hudi 
> contents does not match contents input data. 
> 2672at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 2673at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> 2674at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
> 2675at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
> 2676at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
> 2677at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2678at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2679at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2680at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2681at java.lang.reflect.Method.invoke(Method.java:498)
> 2682at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2683at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2684at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2685at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2686at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2687at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> 2688at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> 2689at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2690Caused by: java.lang.AssertionError: Hudi contents does not match 
> contents input data. 
> 2691at 
> org.apache.hudi.integ.testsuite.dag.nodes.BaseValidateDatasetNode.execute(BaseValidateDatasetNode.java:119)
> 2692at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
> 2693at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
> 2694at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2695at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2696at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2697at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2698at java.lang.Thread.run(Thread.java:748)
> 2699Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> Failed to run Test Suite 
> 2700at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208)
> 2701at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
> 2702at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2703at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2704at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2705at java.lang.reflect.Method.invoke(Method.java:498)
> 2706at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 2707at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> 2708at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> 2709at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> 2710at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> 2711at 
> org.apache.spark.deploy.SparkSubmi

[jira] [Updated] (HUDI-2954) Code cleanup: HFileDataBock - using integer keys is never used

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2954:
-
Fix Version/s: 0.14.0
   (was: 0.13.1)

> Code cleanup: HFileDataBock - using integer keys is never used 
> ---
>
> Key: HUDI-2954
> URL: https://issues.apache.org/jira/browse/HUDI-2954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Minor
> Fix For: 0.14.0
>
>
>  
> KeyField can never be empty for File. If so, there is really no need for 
> falling back to sequential integer keys in the 
> HFileDataBlock::serializeRecords() code path.
>  
> {noformat}
> // Build the record key
> final Field schemaKeyField = 
> records.get(0).getSchema().getField(this.keyField);
> if (schemaKeyField == null) {
>   // Missing key metadata field. Use an integer sequence key instead.
>   useIntegerKey = true;
>   keySize = (int) Math.ceil(Math.log(records.size())) + 1;
> }
> while (itr.hasNext()) {
>   IndexedRecord record = itr.next();
>   String recordKey;
>   if (useIntegerKey) {
> recordKey = String.format("%" + keySize + "s", key++);
>   } else {
> recordKey = record.get(schemaKeyField.pos()).toString();
>   }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5824) COMBINE_BEFORE_UPSERT=false option does not work for upsert

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5824:
-
Fix Version/s: 0.12.3

> COMBINE_BEFORE_UPSERT=false option does not work for upsert
> ---
>
> Key: HUDI-5824
> URL: https://issues.apache.org/jira/browse/HUDI-5824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.1, 0.12.2, 0.13.0
>Reporter: kazdy
>Assignee: kazdy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
> shouldCombine does not take into account the situation where the write 
> operation is UPSERT but COMBINE_BEFORE_UPSERT is false.
> Currently, Hudi always combines records on UPSERT, and option 
> COMBINE_BEFORE_UPSERT is not honored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3646:
-
Priority: Critical  (was: Minor)

> The Hudi update syntax should not modify the nullability attribute of a column
> --
>
> Key: HUDI-3646
> URL: https://issues.apache.org/jira/browse/HUDI-3646
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
> Environment: spark3.1.2
>Reporter: Tao Meng
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.1
>
>
> now, when we use sparksql to update hudi table, we find that  hudi will 
> change the nullability attribute of a column
> eg:
> {code:java}
> // code placeholder
>  val tableName = generateTableName
>  val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>  // create table
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long
>   |) using hudi
>   | location '$tablePath'
>   | options (
>   |  type = '$tableType',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  // insert data to table
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
>  spark.sql(s"select * from $tableName").printSchema()
>  // update data
>  spark.sql(s"update $tableName set price = 20 where id = 1")
>  spark.sql(s"select * from $tableName").printSchema() {code}
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = true)*
>  |-- ts: long (nullable = true)
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = false )*
>  |-- ts: long (nullable = true)
>  
> the nullable attribute of price has been changed to false, This is not the 
> result we want



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3646:
-
Fix Version/s: 0.12.3

> The Hudi update syntax should not modify the nullability attribute of a column
> --
>
> Key: HUDI-3646
> URL: https://issues.apache.org/jira/browse/HUDI-3646
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
> Environment: spark3.1.2
>Reporter: Tao Meng
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.1, 0.12.3
>
>
> now, when we use sparksql to update hudi table, we find that  hudi will 
> change the nullability attribute of a column
> eg:
> {code:java}
> // code placeholder
>  val tableName = generateTableName
>  val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>  // create table
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long
>   |) using hudi
>   | location '$tablePath'
>   | options (
>   |  type = '$tableType',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  // insert data to table
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
>  spark.sql(s"select * from $tableName").printSchema()
>  // update data
>  spark.sql(s"update $tableName set price = 20 where id = 1")
>  spark.sql(s"select * from $tableName").printSchema() {code}
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = true)*
>  |-- ts: long (nullable = true)
>  
>  |-- _hoodie_commit_time: string (nullable = true)
>  |-- _hoodie_commit_seqno: string (nullable = true)
>  |-- _hoodie_record_key: string (nullable = true)
>  |-- _hoodie_partition_path: string (nullable = true)
>  |-- _hoodie_file_name: string (nullable = true)
>  |-- id: integer (nullable = true)
>  |-- name: string (nullable = true)
>  *|-- price: double (nullable = false )*
>  |-- ts: long (nullable = true)
>  
> the nullable attribute of price has been changed to false, This is not the 
> result we want



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5292:
-
Priority: Major  (was: Critical)

> Exclude the test resources from every module packaging
> --
>
> Key: HUDI-5292
> URL: https://issues.apache.org/jira/browse/HUDI-5292
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 0.13.1, 0.12.3
>
>
> Exclude the test resources, especially the properties files that conflict 
> with user-provided resources, from every module. This is a followup to 
> https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5292:
-
Fix Version/s: 0.12.3

> Exclude the test resources from every module packaging
> --
>
> Key: HUDI-5292
> URL: https://issues.apache.org/jira/browse/HUDI-5292
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.1, 0.12.3
>
>
> Exclude the test resources, especially the properties files that conflict 
> with user-provided resources, from every module. This is a followup to 
> https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5292:
-
Priority: Critical  (was: Major)

> Exclude the test resources from every module packaging
> --
>
> Key: HUDI-5292
> URL: https://issues.apache.org/jira/browse/HUDI-5292
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.1
>
>
> Exclude the test resources, especially the properties files that conflict 
> with user-provided resources, from every module. This is a followup to 
> https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5292) Exclude the test resources from every module packaging

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5292:
-
Component/s: dependencies

> Exclude the test resources from every module packaging
> --
>
> Key: HUDI-5292
> URL: https://issues.apache.org/jira/browse/HUDI-5292
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.13.1
>
>
> Exclude the test resources, especially the properties files that conflict 
> with user-provided resources, from every module. This is a followup to 
> https://github.com/apache/hudi/pull/7310#issuecomment-1328728297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5037) Upgrade libthrift in integ-test-bundle

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5037:
-
Fix Version/s: 0.13.0
   (was: 0.13.1)

> Upgrade libthrift in integ-test-bundle
> --
>
> Key: HUDI-5037
> URL: https://issues.apache.org/jira/browse/HUDI-5037
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5037) Upgrade libthrift in integ-test-bundle

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5037.

Resolution: Fixed

> Upgrade libthrift in integ-test-bundle
> --
>
> Key: HUDI-5037
> URL: https://issues.apache.org/jira/browse/HUDI-5037
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131853630


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS
+import org.apache.hudi.common.config.HoodieConfig
+import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.TaskContext
+
+object AutoRecordKeyGenerationUtils {
+
+   // supported operation types when auto generation of record keys is enabled.
+   val supportedOperations: Set[String] =
+Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, 
WriteOperationType.DELETE,

Review Comment:
   nope. its feasible via spark-sql. will tackle this in phase 2
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4557:
-
Fix Version/s: 0.12.3

> Support validation of column stats of avro log files in tests
> -
>
> Key: HUDI-4557
> URL: https://issues.apache.org/jira/browse/HUDI-4557
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1, 0.12.3
>
>
> In TestColumnStatsIndex, when comparing the column stats with the actual data 
> files, only parquet files are supported.  We need to support avro log files 
> as well.  Note that, to validate the column stat of avro log files, we use 
> resource files storing the expected column stat table content for validation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4557:
-
Priority: Critical  (was: Major)

> Support validation of column stats of avro log files in tests
> -
>
> Key: HUDI-4557
> URL: https://issues.apache.org/jira/browse/HUDI-4557
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1
>
>
> In TestColumnStatsIndex, when comparing the column stats with the actual data 
> files, only parquet files are supported.  We need to support avro log files 
> as well.  Note that, to validate the column stat of avro log files, we use 
> resource files storing the expected column stat table content for validation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4557:
-
Issue Type: Test  (was: Improvement)

> Support validation of column stats of avro log files in tests
> -
>
> Key: HUDI-4557
> URL: https://issues.apache.org/jira/browse/HUDI-4557
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.1
>
>
> In TestColumnStatsIndex, when comparing the column stats with the actual data 
> files, only parquet files are supported.  We need to support avro log files 
> as well.  Note that, to validate the column stat of avro log files, we use 
> resource files storing the expected column stat table content for validation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2782:
-
Issue Type: Bug  (was: Improvement)

> Fix marker based strategy for structured streaming
> --
>
> Key: HUDI-2782
> URL: https://issues.apache.org/jira/browse/HUDI-2782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.1
>
>
> As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are 
> making timeline server based as the default marker type. But we have an issue 
> w/ structured streaming. Looks like after 1st micro batch, the timeline 
> server gets shutdown and for subsequent micro batches, timeline server is not 
> available. So, in the patch we have made marker based overridden just for 
> structured streaming. 
>  
> We may want to revisit this and see how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2782:
-
Fix Version/s: 0.12.3

> Fix marker based strategy for structured streaming
> --
>
> Key: HUDI-2782
> URL: https://issues.apache.org/jira/browse/HUDI-2782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.1, 0.12.3
>
>
> As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are 
> making timeline server based as the default marker type. But we have an issue 
> w/ structured streaming. Looks like after 1st micro batch, the timeline 
> server gets shutdown and for subsequent micro batches, timeline server is not 
> available. So, in the patch we have made marker based overridden just for 
> structured streaming. 
>  
> We may want to revisit this and see how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2506) Hudi dependency governance

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2506:
-
Fix Version/s: 0.14.0

> Hudi dependency governance
> --
>
> Key: HUDI-2506
> URL: https://issues.apache.org/jira/browse/HUDI-2506
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies, Usability
>Reporter: vinoyang
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2782) Fix marker based strategy for structured streaming

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2782:
-
Priority: Critical  (was: Major)

> Fix marker based strategy for structured streaming
> --
>
> Key: HUDI-2782
> URL: https://issues.apache.org/jira/browse/HUDI-2782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.1, 0.12.3
>
>
> As part of [this|https://github.com/apache/hudi/pull/3967] patch, we are 
> making timeline server based as the default marker type. But we have an issue 
> w/ structured streaming. Looks like after 1st micro batch, the timeline 
> server gets shutdown and for subsequent micro batches, timeline server is not 
> available. So, in the patch we have made marker based overridden just for 
> structured streaming. 
>  
> We may want to revisit this and see how to go about it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2506) Hudi dependency governance

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2506:
-
Issue Type: Test  (was: Improvement)

> Hudi dependency governance
> --
>
> Key: HUDI-2506
> URL: https://issues.apache.org/jira/browse/HUDI-2506
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies, Usability
>Reporter: vinoyang
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2506) Hudi dependency governance

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2506:
-
Priority: Critical  (was: Major)

> Hudi dependency governance
> --
>
> Key: HUDI-2506
> URL: https://issues.apache.org/jira/browse/HUDI-2506
> Project: Apache Hudi
>  Issue Type: Test
>  Components: dependencies, Usability
>Reporter: vinoyang
>Assignee: Lokesh Jain
>Priority: Critical
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5721) Add Github actions on more validations

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5721:
-
Priority: Blocker  (was: Critical)

> Add Github actions on more validations
> --
>
> Key: HUDI-5721
> URL: https://issues.apache.org/jira/browse/HUDI-5721
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> Add the following validation from source release validation to Github actions:
>  * Binary files should not be present
>  * DISCLAIMER file should not be present
>  * LICENSE and NOTICE should exist
>  * Licensing check
>  * RAT check



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5721) Add Github actions on more validations

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5721:
-
Fix Version/s: 0.12.3

> Add Github actions on more validations
> --
>
> Key: HUDI-5721
> URL: https://issues.apache.org/jira/browse/HUDI-5721
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> Add the following validation from source release validation to Github actions:
>  * Binary files should not be present
>  * DISCLAIMER file should not be present
>  * LICENSE and NOTICE should exist
>  * Licensing check
>  * RAT check



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5794) Fail any new commits if there is any inflight restore in timeline

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5794.

Resolution: Fixed

> Fail any new commits if there is any inflight restore in timeline
> -
>
> Key: HUDI-5794
> URL: https://issues.apache.org/jira/browse/HUDI-5794
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> if restore failed mid-way, users should not be allowed to start new commits. 
> lets add a guard rail around that. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5794) Fail any new commits if there is any inflight restore in timeline

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5794:
-
Fix Version/s: 0.12.3

> Fail any new commits if there is any inflight restore in timeline
> -
>
> Key: HUDI-5794
> URL: https://issues.apache.org/jira/browse/HUDI-5794
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.12.3
>
>
> if restore failed mid-way, users should not be allowed to start new commits. 
> lets add a guard rail around that. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5721) Add Github actions on more validations

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5721:
-
Issue Type: Test  (was: Improvement)

> Add Github actions on more validations
> --
>
> Key: HUDI-5721
> URL: https://issues.apache.org/jira/browse/HUDI-5721
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Add the following validation from source release validation to Github actions:
>  * Binary files should not be present
>  * DISCLAIMER file should not be present
>  * LICENSE and NOTICE should exist
>  * Licensing check
>  * RAT check



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5612) Integrate metadata table with SpillableMapBasedFileSystemView and RocksDbBasedFileSystemView

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5612:
-
Component/s: metadata

> Integrate metadata table with SpillableMapBasedFileSystemView and 
> RocksDbBasedFileSystemView
> 
>
> Key: HUDI-5612
> URL: https://issues.apache.org/jira/browse/HUDI-5612
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1
>
>
> Currently, metadata-table-based file listing is integrated through 
> HoodieMetadataFileSystemView.  SpillableMapBasedFileSystemView (storage type 
> of SPILLABLE_DISK) and RocksDbBasedFileSystemView (storage type of 
> EMBEDDED_KV_STORE) are independent of HoodieMetadataFileSystemView, and these 
> two file system view cannot leverage metadata-table-based file listing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5611) Revisit metadata-table-based file listing calls and use batch lookup instead

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5611:
-
Component/s: metadata

> Revisit metadata-table-based file listing calls and use batch lookup instead
> 
>
> Key: HUDI-5611
> URL: https://issues.apache.org/jira/browse/HUDI-5611
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.13.1
>
>
> We discover a performance issue with savepoint when the metadata table is 
> enabled. It is due to unnecessary scanning of the metadata table when the 
> number of partitions is large. When the metadata table is enabled, in the 
> savepoint operation, for each partition, the metadata table is scanned, which 
> leads to a lot of S3 requests.  The solution is to batch the list calls of 
> all partitions (HUDI-5485).
>  
> We need to revisit metadata-table-based file listing calls in a similar 
> fashion and replace them with batch lookup if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131847074


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestAutoGenerationOfRecordKeys.scala:
##
@@ -0,0 +1,282 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hudi.HoodieConversionUtils.toJavaOption
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.{HoodieRecord, HoodieTableType, 
WriteOperationType}
+import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.common.util
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.ExceptionUtil.getRootCause
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.functional.CommonOptionUtils._
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.keygen.{ComplexKeyGenerator, 
NonpartitionedKeyGenerator, SimpleKeyGenerator, TimestampBasedKeyGenerator}
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions.Config
+import org.apache.hudi.testutils.HoodieSparkClientTestBase
+import org.apache.hudi.util.JFunction
+import org.apache.hudi.{DataSourceWriteOptions, HoodieDataSourceHelpers, 
ScalaAssertionSupport}
+import org.apache.spark.sql.hudi.HoodieSparkSessionExtension
+import org.apache.spark.sql.{SaveMode, SparkSession, SparkSessionExtensions}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue}
+import org.junit.jupiter.api.{AfterEach, BeforeEach, Test}
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.{CsvSource, EnumSource}
+
+import java.util.function.Consumer
+import scala.collection.JavaConversions._
+import scala.collection.JavaConverters._
+
+class TestAutoGenerationOfRecordKeys extends HoodieSparkClientTestBase with 
ScalaAssertionSupport {
+  var spark: SparkSession = null

Review Comment:
   this will be set in BeforeEach method. we don't have any code paths were 
this might be null. I don't think we need to add Option here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845834


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1096,31 +1104,47 @@ object HoodieSparkSqlWriter {
   Some(writerSchema))
 
 avroRecords.mapPartitions(it => {
+  val sparkPartitionId = TaskContext.getPartitionId()
+
   val dataFileSchema = new Schema.Parser().parse(dataFileSchemaStr)
   val consistentLogicalTimestampEnabled = parameters.getOrElse(
 
DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key(),
 
DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.defaultValue()).toBoolean
 
-  it.map { avroRecord =>
+  // generate record keys is auto generation is enabled.
+  val recordsWithRecordKeyOverride = 
mayBeAutoGenerateRecordKeys(autoGenerateRecordKeys, it, instantTime)
+
+  // handle dropping partition columns
+  recordsWithRecordKeyOverride.map { avroRecordRecordKeyOverRide =>
 val processedRecord = if (shouldDropPartitionColumns) {
-  HoodieAvroUtils.rewriteRecord(avroRecord, dataFileSchema)
+  HoodieAvroUtils.rewriteRecord(avroRecordRecordKeyOverRide._1, 
dataFileSchema)
+} else {
+  avroRecordRecordKeyOverRide._1
+}
+
+// Generate HoodieKey for records
+val hoodieKey = if (autoGenerateRecordKeys) {
+  // fetch record key from the recordKeyOverride if auto 
generation is enabled.
+  new HoodieKey(avroRecordRecordKeyOverRide._2.get, 
keyGenerator.getKey(avroRecordRecordKeyOverRide._1).getPartitionPath)

Review Comment:
   Since we have plans to fix this w/ 
   https://github.com/apache/hudi/pull/7699
   HUDI-5535, I don't want to add additional apis to the base 
interface/abstract class for now. 
   lets revisit holistically. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845834


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1096,31 +1104,47 @@ object HoodieSparkSqlWriter {
   Some(writerSchema))
 
 avroRecords.mapPartitions(it => {
+  val sparkPartitionId = TaskContext.getPartitionId()
+
   val dataFileSchema = new Schema.Parser().parse(dataFileSchemaStr)
   val consistentLogicalTimestampEnabled = parameters.getOrElse(
 
DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key(),
 
DataSourceWriteOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.defaultValue()).toBoolean
 
-  it.map { avroRecord =>
+  // generate record keys is auto generation is enabled.
+  val recordsWithRecordKeyOverride = 
mayBeAutoGenerateRecordKeys(autoGenerateRecordKeys, it, instantTime)
+
+  // handle dropping partition columns
+  recordsWithRecordKeyOverride.map { avroRecordRecordKeyOverRide =>
 val processedRecord = if (shouldDropPartitionColumns) {
-  HoodieAvroUtils.rewriteRecord(avroRecord, dataFileSchema)
+  HoodieAvroUtils.rewriteRecord(avroRecordRecordKeyOverRide._1, 
dataFileSchema)
+} else {
+  avroRecordRecordKeyOverRide._1
+}
+
+// Generate HoodieKey for records
+val hoodieKey = if (autoGenerateRecordKeys) {
+  // fetch record key from the recordKeyOverride if auto 
generation is enabled.
+  new HoodieKey(avroRecordRecordKeyOverRide._2.get, 
keyGenerator.getKey(avroRecordRecordKeyOverRide._1).getPartitionPath)

Review Comment:
   yes. 
   https://github.com/apache/hudi/pull/7699
   HUDI-5535



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131845254


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS
+import org.apache.hudi.common.config.HoodieConfig
+import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.TaskContext
+
+object AutoRecordKeyGenerationUtils {
+
+   // supported operation types when auto generation of record keys is enabled.
+   val supportedOperations: Set[String] =
+Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, 
WriteOperationType.DELETE,
+  WriteOperationType.INSERT_OVERWRITE, 
WriteOperationType.INSERT_OVERWRITE_TABLE,
+  WriteOperationType.DELETE_PARTITION).map(_.name())
+
+  def validateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, 
String],
+  operation: 
WriteOperationType, hoodieConfig: HoodieConfig): Unit = {
+val autoGenerateRecordKeys: Boolean = 
parameters.getOrElse(HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.key(),
+  HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.defaultValue()).toBoolean
+
+if (autoGenerateRecordKeys) {
+  // check for supported operations.
+  if (!supportedOperations.contains(operation.name())) {
+throw new HoodieException(operation.name() + " is not supported with 
Auto generation of record keys. "
+  + "Supported operations are : " + supportedOperations)
+  }
+  // de-dup is not supported with auto generation of record keys
+  if (parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT.key(),
+HoodieWriteConfig.COMBINE_BEFORE_INSERT.defaultValue()).toBoolean) {
+throw new HoodieException("Enabling " + 
HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() + " is not supported with auto 
generation of record keys ");
+  }
+  // drop dupes is not supported
+  if (hoodieConfig.getBoolean(INSERT_DROP_DUPS)) {
+throw new HoodieException("Enabling " + INSERT_DROP_DUPS.key() + " is 
not supported with auto generation of record keys ");
+  }
+  // virtual keys are not supported with auto generation of record keys.
+  if (!parameters.getOrElse(HoodieTableConfig.POPULATE_META_FIELDS.key(), 
HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString).toBoolean) {
+throw new HoodieException("Disabling " + 
HoodieTableConfig.POPULATE_META_FIELDS.key() + " is not supported with auto 
generation of record keys");
+  }
+}
+  }
+
+  /**
+   * Auto Generate record keys when auto generation config is enabled.
+   * 
+   *   Generated keys will be unique not only w/in provided 
[[org.apache.spark.sql.DataFrame]], but
+   *   globally unique w/in the target table
+   *   Generated keys have minimal overhead (to compute, persist and 
read)
+   * 
+   *
+   * Keys adhere to the following format:
+   *
+   * [instantTime]_[PartitionId]_[RowId]
+   *
+   * where
+   * instantTime refers to the commit time of the batch being ingested.
+   * PartitionId refers to spark's partition Id.
+   * RowId refers to the row index within the spark partition.
+   *
+   * @param autoGenerateKeys true if auto generation of record keys is 
enabled. false otherwise.
+   * @param genRecsItr Iterator of GenericRecords.
+   * @param instantTime commit time of the batch.
+   * @return Iterator of Pair of GenericRecord and Optionally generated record 
key.
+   */
+  def mayBeAutoGenerateRecordKeys(autoGenerateKeys : Boolean, genRecsItr: 
Iterator[GenericRecord], instantTime: String): Iterator[(GenericRecord, 
Option[String])] = {
+var rowId = 0
+val sparkPartitionId = TaskContext.getPartitionId()
+
+// we will override record keys if auto generation if keys is enabled.
+genRecsItr.map(avroReco

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131844741


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS
+import org.apache.hudi.common.config.HoodieConfig
+import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.TaskContext
+
+object AutoRecordKeyGenerationUtils {
+
+   // supported operation types when auto generation of record keys is enabled.
+   val supportedOperations: Set[String] =
+Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, 
WriteOperationType.DELETE,
+  WriteOperationType.INSERT_OVERWRITE, 
WriteOperationType.INSERT_OVERWRITE_TABLE,
+  WriteOperationType.DELETE_PARTITION).map(_.name())
+
+  def validateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, 
String],
+  operation: 
WriteOperationType, hoodieConfig: HoodieConfig): Unit = {
+val autoGenerateRecordKeys: Boolean = 
parameters.getOrElse(HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.key(),
+  HoodieTableConfig.AUTO_GENERATE_RECORD_KEYS.defaultValue()).toBoolean
+
+if (autoGenerateRecordKeys) {
+  // check for supported operations.
+  if (!supportedOperations.contains(operation.name())) {
+throw new HoodieException(operation.name() + " is not supported with 
Auto generation of record keys. "
+  + "Supported operations are : " + supportedOperations)
+  }
+  // de-dup is not supported with auto generation of record keys
+  if (parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT.key(),
+HoodieWriteConfig.COMBINE_BEFORE_INSERT.defaultValue()).toBoolean) {
+throw new HoodieException("Enabling " + 
HoodieWriteConfig.COMBINE_BEFORE_INSERT.key() + " is not supported with auto 
generation of record keys ");
+  }
+  // drop dupes is not supported
+  if (hoodieConfig.getBoolean(INSERT_DROP_DUPS)) {
+throw new HoodieException("Enabling " + INSERT_DROP_DUPS.key() + " is 
not supported with auto generation of record keys ");
+  }
+  // virtual keys are not supported with auto generation of record keys.
+  if (!parameters.getOrElse(HoodieTableConfig.POPULATE_META_FIELDS.key(), 
HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString).toBoolean) {
+throw new HoodieException("Disabling " + 
HoodieTableConfig.POPULATE_META_FIELDS.key() + " is not supported with auto 
generation of record keys");
+  }
+}
+  }
+
+  /**
+   * Auto Generate record keys when auto generation config is enabled.
+   * 
+   *   Generated keys will be unique not only w/in provided 
[[org.apache.spark.sql.DataFrame]], but
+   *   globally unique w/in the target table
+   *   Generated keys have minimal overhead (to compute, persist and 
read)
+   * 
+   *
+   * Keys adhere to the following format:
+   *
+   * [instantTime]_[PartitionId]_[RowId]
+   *
+   * where
+   * instantTime refers to the commit time of the batch being ingested.
+   * PartitionId refers to spark's partition Id.
+   * RowId refers to the row index within the spark partition.
+   *
+   * @param autoGenerateKeys true if auto generation of record keys is 
enabled. false otherwise.
+   * @param genRecsItr Iterator of GenericRecords.
+   * @param instantTime commit time of the batch.
+   * @return Iterator of Pair of GenericRecord and Optionally generated record 
key.
+   */
+  def mayBeAutoGenerateRecordKeys(autoGenerateKeys : Boolean, genRecsItr: 
Iterator[GenericRecord], instantTime: String): Iterator[(GenericRecord, 
Option[String])] = {
+var rowId = 0
+val sparkPartitionId = TaskContext.getPartitionId()
+
+// we will override record keys if auto generation if keys is enabled.
+genRecsItr.map(avroReco

[jira] [Updated] (HUDI-4245) Support nested fields in Column Stats Index

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4245:
-
Component/s: metadata

> Support nested fields in Column Stats Index
> ---
>
> Key: HUDI-4245
> URL: https://issues.apache.org/jira/browse/HUDI-4245
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.1
>
>
> Currently only root-level fields are supported in the Column Stats Index, 
> while there's no reason for us not to be able to support nested fields given 
> that columnar file formats store nested fields as _nested columns,_ ie as 
> columns with a name of the field and corresponding struct it attributes to. 
>  
> For example following schema: 
> {code:java}
> c1: StringType
> c2: StructType(Seq(StructField("foo", StringType))){code}
> Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that 
> Parquet actually already collects statistics for all the nested fields and we 
> just need to make sure we're propagating them into Column Stats Index
>  
> Original GH issue:
> [https://github.com/apache/hudi/issues/5804#issuecomment-1152983029]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi

2023-03-09 Thread via GitHub


nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1131842281


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.DataSourceWriteOptions.INSERT_DROP_DUPS
+import org.apache.hudi.common.config.HoodieConfig
+import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.TaskContext
+
+object AutoRecordKeyGenerationUtils {
+
+   // supported operation types when auto generation of record keys is enabled.
+   val supportedOperations: Set[String] =
+Set(WriteOperationType.INSERT, WriteOperationType.BULK_INSERT, 
WriteOperationType.DELETE,
+  WriteOperationType.INSERT_OVERWRITE, 
WriteOperationType.INSERT_OVERWRITE_TABLE,
+  WriteOperationType.DELETE_PARTITION).map(_.name())

Review Comment:
   as called out in the docs, UPDATE and DELETE via spark-sql should be 
supported. That will be phase 2.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >