date:20230403

[jira] [Closed] (HUDI-6022) The method param `instantTime` of org.apache.hudi.table.action.commit.BaseFlinkCommitActionExecutor#handleUpsertPartition is redundant

2023-04-03 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6022.

Fix Version/s: 0.14.0
   Resolution: Fixed

Fixed via master branch: 9288fdc456f9a4215d32908756a4ddaee18abfc4

> The method param `instantTime` of 
> org.apache.hudi.table.action.commit.BaseFlinkCommitActionExecutor#handleUpsertPartition
>  is redundant
> --
>
> Key: HUDI-6022
> URL: https://issues.apache.org/jira/browse/HUDI-6022
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Major
>  Labels: easyfix, pull-request-available
> Fix For: 0.14.0
>
>
> We have stored the `instantTime` in the superclass BaseActionExector, and 
> there's no need to keep a method param 'instantTime`, it's preferred to 
> remove it to make code cleaner.
> {code:java}
> protected Iterator> handleUpsertPartition(
> String instantTime,
> String partitionPath,
> String fileIdHint,
> BucketType bucketType,
> Iterator recordItr) {
>   try {
> if (this.writeHandle instanceof HoodieCreateHandle) {
>   // During one checkpoint interval, an insert record could also be 
> updated,
>   // for example, for an operation sequence of a record:
>   //I, U,   | U, U
>   // - batch1 - | - batch2 -
>   // the first batch(batch1) operation triggers an INSERT bucket,
>   // the second batch batch2 tries to reuse the same bucket
>   // and append instead of UPDATE.
>   return handleInsert(fileIdHint, recordItr);
> } else if (this.writeHandle instanceof HoodieMergeHandle) {
>   return handleUpdate(partitionPath, fileIdHint, recordItr);
> } else {
>   switch (bucketType) {
> case INSERT:
>   return handleInsert(fileIdHint, recordItr);
> case UPDATE:
>   return handleUpdate(partitionPath, fileIdHint, recordItr);
> default:
>   throw new AssertionError();
>   }
> }
>   } catch (Throwable t) {
> String msg = "Error upsetting bucketType " + bucketType + " for partition 
> :" + partitionPath;
> LOG.error(msg, t);
> throw new HoodieUpsertException(msg, t);
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (5d5658347ad -> 9288fdc456f)

2023-04-03 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 5d5658347ad [HUDI-5983] Improve loading data via cloud store incr 
source (#8290)
 add 9288fdc456f [HUDI-6022] Remove redundant method param of 
BaseFlinkCommitActionExecutor (#8363)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/table/action/commit/BaseFlinkCommitActionExecutor.java  | 2 --
 1 file changed, 2 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #8363: [HUDI-6022] Remove redundant method param of BaseFlinkCommitActionExec…

2023-04-03 Thread via GitHub



danny0405 merged PR #8363:
URL: https://github.com/apache/hudi/pull/8363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao closed pull request #8322: [WIP]spark should pass InstantRange to incremental query for log files

2023-04-03 Thread via GitHub



xiarixiaoyao closed pull request #8322: [WIP]spark should pass InstantRange to 
incremental query for log files
URL: https://github.com/apache/hudi/pull/8322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8375: [MINOR]Remove the redundancy config

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8375:
URL: https://github.com/apache/hudi/pull/8375#issuecomment-1495442490

   
   ## CI report:
   
   * 3a3da94e83aa8d193a7a7351e4c113999a8197b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8351:
URL: https://github.com/apache/hudi/pull/8351#issuecomment-1495442311

   
   ## CI report:
   
   * 62cce26c004b5dabd45271bda4141a730ddad6cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16052)
 
   * 1b46aa826f2f6733595fa26461aa5fa2ef00199d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8366: [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError:

2023-04-03 Thread via GitHub



danny0405 commented on issue #8366:
URL: https://github.com/apache/hudi/issues/8366#issuecomment-1495439988

   Which class is missing here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5955) Incremental clean does not work with archived commits

2023-04-03 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-5955:
-
Summary: Incremental clean does not work with archived commits  (was: fix 
incremental clean not work cause by archive)

> Incremental clean does not work with archived commits
> -
>
> Key: HUDI-5955
> URL: https://issues.apache.org/jira/browse/HUDI-5955
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: HBG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on a diff in pull request #8373: [HUDI-5955] Incremental clean does not work with archived commits

2023-04-03 Thread via GitHub



danny0405 commented on code in PR #8373:
URL: https://github.com/apache/hudi/pull/8373#discussion_r1156802426


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -165,7 +165,8 @@ private List 
getPartitionPathsForCleanByCommits(Option in
   HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils
   
.deserializeHoodieCleanMetadata(hoodieTable.getActiveTimeline().getInstantDetails(lastClean.get()).get());
   if ((cleanMetadata.getEarliestCommitToRetain() != null)
-  && (cleanMetadata.getEarliestCommitToRetain().length() > 0)) 
{
+  && (cleanMetadata.getEarliestCommitToRetain().length() > 0)
+  && 
!hoodieTable.getActiveTimeline().isBeforeTimelineStarts(cleanMetadata.getEarliestCommitToRetain()))
 {
 return getPartitionPathsForIncrementalCleaning(cleanMetadata, 
instantToRetain);

Review Comment:
   Nice catch, can we write a UT if possible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8375: [MINOR]Remove the redundancy config

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8375:
URL: https://github.com/apache/hudi/pull/8375#issuecomment-1495434897

   
   ## CI report:
   
   * 3a3da94e83aa8d193a7a7351e4c113999a8197b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8102:
URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495434217

   
   ## CI report:
   
   * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8371: [SUPPORT] Flink cant read metafield '_hoodie_commit_time'

2023-04-03 Thread via GitHub



danny0405 commented on issue #8371:
URL: https://github.com/apache/hudi/issues/8371#issuecomment-1495432965

   Seems a bug, could you fire a PR and fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156783371


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -168,4 +171,36 @@ public static List filterKeysFromFile(Path 
filePath, List candid
 }
 return foundRecordKeys;
   }
+
+  public static  HoodieData> 
dedupForPartitionUpdates(HoodieData, Boolean>> 
taggedHoodieRecords, int dedupParallelism) {
+/*
+ * In case a record is updated from p1 to p2 and then to p3, 2 existing 
records
+ * will be tagged for the incoming record to insert to p3. So we dedup 
them here. (Set A)
+ */
+HoodiePairData> deduped = 
taggedHoodieRecords.filter(Pair::getRight)
+.map(Pair::getLeft)
+.distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+.mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+/*
+ * This includes
+ *  - tagged existing records whose partition paths are not to be updated 
(Set B)
+ *  - completely new records (Set C)
+ */
+HoodieData> undeduped = taggedHoodieRecords.filter(p -> 
!p.getRight()).map(Pair::getLeft);
+
+/*
+ * There can be intersection between Set A and Set B mentioned above.
+ *
+ * Example: record X is updated from p1 to p2 and then back to p1.
+ * Set A will contain an insert to p1 and Set B will contain an update to 
p1.
+ *
+ * So we let A left-anti join B to drop the insert from Set A and keep the 
update in Set B.
+ */
+return deduped.leftOuterJoin(undeduped
+.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   synced up directly. lets add java docs to call this out, ie. why we should 
strictly favor update record and not insert. so that anyone looking to make any 
changes in this code block is aware of all the nuances.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



danny0405 commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1156776979


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java:
##
@@ -44,6 +48,9 @@ public ComplexAvroKeyGenerator(TypedProperties props) {
 
   @Override
   public String getRecordKey(GenericRecord record) {
+if (autoGenerateRecordKeys()) {
+  return StringUtils.EMPTY_STRING;
+}

Review Comment:
   We already have `getRecordKey` and `getPartitionPath` as the public API, if 
you want to fix the `HoodieKey`, shouldn't the `HoodieKey getKey()` be fixed 
instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] huangxiaopingRD commented on a diff in pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap

2023-04-03 Thread via GitHub



huangxiaopingRD commented on code in PR #8351:
URL: https://github.com/apache/hudi/pull/8351#discussion_r1156774622


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala:
##
@@ -70,4 +70,16 @@ object HoodieCLIUtils extends ProvidesHoodieConfig{
 throw new SparkException(s"Unsupported identifier $table")
 }
   }
+
+  def getHoodieDatabaseAndTable(table: String): (String, Option[String]) = {
+val seq: Seq[String] = table.split('.')

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] c-f-cooper opened a new pull request, #8375: [MINOR]Remove the redundancy config

2023-04-03 Thread via GitHub



c-f-cooper opened a new pull request, #8375:
URL: https://github.com/apache/hudi/pull/8375

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8351: [HUDI-6013] Support database name for meta sync in bootstrap

2023-04-03 Thread via GitHub



danny0405 commented on code in PR #8351:
URL: https://github.com/apache/hudi/pull/8351#discussion_r1156765591


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala:
##
@@ -70,4 +70,16 @@ object HoodieCLIUtils extends ProvidesHoodieConfig{
 throw new SparkException(s"Unsupported identifier $table")
 }
   }
+
+  def getHoodieDatabaseAndTable(table: String): (String, Option[String]) = {
+val seq: Seq[String] = table.split('.')

Review Comment:
   `getHoodieDatabaseAndTable` -> `getTableIdentifier`, the returned val should 
be a string array



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8373:
URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495391829

   
   ## CI report:
   
   * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]

2023-04-03 Thread via GitHub



bigdata-spec commented on issue #8368:
URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495385255

   > @bigdata-spec我想此时您必须只使用受支持的 HMS 版本。@huangxiaopingRD可以评论更多。
   
   @ad1happy2go  what HMS version means？ is fit hudi or spark？ hudi support  
HMS version  for 2.1.1-cdh6.3.2.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-04-03 Thread via GitHub



codope commented on code in PR #8303:
URL: https://github.com/apache/hudi/pull/8303#discussion_r1156744308


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -180,6 +180,22 @@ case class HoodieFileIndex(spark: SparkSession,
 }
   }
 
+  /**
+   * In the fast bootstrap read code path, it gets the file status for the 
bootstrap base files instead of
+   * skeleton files.
+   */
+  private def getBaseFileStatus(baseFiles: mutable.Buffer[HoodieBaseFile]): 
mutable.Buffer[FileStatus] = {
+if (shouldFastBootstrap) {
+ return baseFiles.map(f =>
+if (f.getBootstrapBaseFile.isPresent) {
+ f.getBootstrapBaseFile.get().getFileStatus

Review Comment:
   Why do we need to guard this by `shouldFastBootstrap` conditional? Shouldn't 
we always return the source file status if it's present?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



nsivabalan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156745651


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -168,4 +171,36 @@ public static List filterKeysFromFile(Path 
filePath, List candid
 }
 return foundRecordKeys;
   }
+
+  public static  HoodieData> 
dedupForPartitionUpdates(HoodieData, Boolean>> 
taggedHoodieRecords, int dedupParallelism) {
+/*
+ * In case a record is updated from p1 to p2 and then to p3, 2 existing 
records
+ * will be tagged for the incoming record to insert to p3. So we dedup 
them here. (Set A)
+ */
+HoodiePairData> deduped = 
taggedHoodieRecords.filter(Pair::getRight)
+.map(Pair::getLeft)
+.distinctWithKey(HoodieRecord::getKey, dedupParallelism)
+.mapToPair(r -> Pair.of(r.getRecordKey(), r));
+
+/*
+ * This includes
+ *  - tagged existing records whose partition paths are not to be updated 
(Set B)
+ *  - completely new records (Set C)
+ */
+HoodieData> undeduped = taggedHoodieRecords.filter(p -> 
!p.getRight()).map(Pair::getLeft);
+
+/*
+ * There can be intersection between Set A and Set B mentioned above.
+ *
+ * Example: record X is updated from p1 to p2 and then back to p1.
+ * Set A will contain an insert to p1 and Set B will contain an update to 
p1.
+ *
+ * So we let A left-anti join B to drop the insert from Set A and keep the 
update in Set B.
+ */
+return deduped.leftOuterJoin(undeduped
+.filter(r -> !(r.getData() instanceof EmptyHoodieRecordPayload))

Review Comment:
   does it matter if we favor insert or an update here? 
   If yes, I feel its better to favor insert and drop the update. so that we 
maintain the behavior across the board. i.e. whenever a record migrates from 
one partition to another, we will ignore whatever in storage and do an insert 
to incoming partition. to maintain similar semantics, thinking if we shd favor 
insert record over update. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-04-03 Thread via GitHub



codope commented on code in PR #8303:
URL: https://github.com/apache/hudi/pull/8303#discussion_r1156740671


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -270,6 +271,21 @@ object DefaultSource {
 }
   }
 
+  private def resolveHoodieBootstrapRelation(sqlContext: SQLContext,
+ globPaths: Seq[Path],
+ userSchema: Option[StructType],
+ metaClient: HoodieTableMetaClient,
+ parameters: Map[String, String]): 
BaseRelation = {
+val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, 
sqlContext.sparkSession.sessionState.conf,
+  ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+if (!enableFileIndex || globPaths.nonEmpty || 
parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != 
"true") {

Review Comment:
   I think we should do away with the config and rely on the condition here to 
decide whether or not to use the fast read path (which should be done by 
default). Wdyt?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala:
##
@@ -807,7 +807,9 @@ class TestHoodieSparkSqlWriter {
 .option("hoodie.insert.shuffle.parallelism", "4")
 .mode(SaveMode.Append).save(tempBasePath)
 
-  val currentCommits = 
spark.read.format("hudi").load(tempBasePath).select("_hoodie_commit_time").take(1).map(_.getString(0))
+  val currentCommits = spark.read.format("hudi")
+.option(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key, "false")

Review Comment:
   Need more tests. Setting it to `false` does not test the changed code path.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -180,6 +180,22 @@ case class HoodieFileIndex(spark: SparkSession,
 }
   }
 
+  /**
+   * In the fast bootstrap read code path, it gets the file status for the 
bootstrap base files instead of
+   * skeleton files.
+   */
+  private def getBaseFileStatus(baseFiles: mutable.Buffer[HoodieBaseFile]): 
mutable.Buffer[FileStatus] = {
+if (shouldFastBootstrap) {
+ return baseFiles.map(f =>
+if (f.getBootstrapBaseFile.isPresent) {
+ f.getBootstrapBaseFile.get().getFileStatus

Review Comment:
   Why do we need to guard this by `shouldFastBootstrap` conditional? Shouldn't 
we always return the source file status if it's present>



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##
@@ -83,10 +83,18 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
   /**
* Get the schema of the table.
*/
-  lazy val schema: StructType = schemaSpec.getOrElse({
-val schemaUtil = new TableSchemaResolver(metaClient)
-
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
-  })
+  lazy val schema: StructType = if (shouldFastBootstrap) {
+  StructType(rawSchema.fields.filterNot(f => 
HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name)))

Review Comment:
   just import the static member `HOODIE_META_COLUMNS_WITH_OPERATION` instead 
of importing full `HoodieRecord`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]

2023-04-03 Thread via GitHub



ad1happy2go commented on issue #8368:
URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495363471

   @bigdata-spec I guess at this point you have to use the supported HMS 
version only. @huangxiaopingRD can comment more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]

2023-04-03 Thread via GitHub



bigdata-spec commented on issue #8368:
URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495349137

   > `spark.sql.hive.metastore.version` is not supported in hudi. hudi not 
compatible with all hive metastore version like Spark.
   
   So，What can I do deal with this error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8374:
URL: https://github.com/apache/hudi/pull/8374#issuecomment-1495349020

   
   ## CI report:
   
   * 7c8e63752c3f709c3102a5c412c1ec9c40846b90 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16111)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495348583

   
   ## CI report:
   
   * 572189472623065f460bd18436fb3b21602449af Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101)
 
   * 711df161776bfbe4f66cb04310eb15ccc0069716 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495343022

   
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104)
 
   * 7624300eb0d7205a4924783606226bbdfd49ad5a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8374:
URL: https://github.com/apache/hudi/pull/8374#issuecomment-1495343365

   
   ## CI report:
   
   * 7c8e63752c3f709c3102a5c412c1ec9c40846b90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495342734

   
   ## CI report:
   
   * 572189472623065f460bd18436fb3b21602449af Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101)
 
   * 711df161776bfbe4f66cb04310eb15ccc0069716 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



codope commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156725280


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
   .defaultValue("true")
   .withDocumentation("Similar to " + 
BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty GLOBAL_INDEX_DEDUP_PARALLELISM = 
ConfigProperty

Review Comment:
   ok, let's keep it this way. we can revisit later if necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #5165: [HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance

2023-04-03 Thread via GitHub



bvaradar commented on PR #5165:
URL: https://github.com/apache/hudi/pull/5165#issuecomment-1495341234

   @xiarixiaoyao : Can you address the comments in the PR ? 
   @garyli1019 : Any other concern about having vectorization for incr query 
for MOR  (with default turned off ? ) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8367:
URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495339250

   
   ## CI report:
   
   * 38951b92ba068d155efc85b1b38ce860bf3551d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16091)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16102)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495339170

   
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104)
 
   * 7624300eb0d7205a4924783606226bbdfd49ad5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8335:
URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495339120

   
   ## CI report:
   
   * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096)
 
   * 9bcbb85e4b2bb803e03900b8f01c938833bb1185 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8060: [SUPPORT] An instant exception occurs when the flink job is restarted

2023-04-03 Thread via GitHub



danny0405 commented on issue #8060:
URL: https://github.com/apache/hudi/issues/8060#issuecomment-1495338029

   Fire a fix in: https://github.com/apache/hudi/pull/8374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6030) Cleans the ckp meta while the JM restarts

2023-04-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6030:
-
Labels: pull-request-available  (was: )

> Cleans the ckp meta while the JM restarts
> -
>
> Key: HUDI-6030
> URL: https://issues.apache.org/jira/browse/HUDI-6030
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 opened a new pull request, #8374: [HUDI-6030] Cleans the ckp meta while the JM restarts

2023-04-03 Thread via GitHub



danny0405 opened a new pull request, #8374:
URL: https://github.com/apache/hudi/pull/8374

   ### Change Logs
   
   We received several bug reports since #7620, for example: 
https://github.com/apache/hudi/issues/8060, this patch revert the changes of 
`CkpMetadata` and already report the write metadata events for write task, the 
coordinator would decide whether to re-commit these metadata stats.
   
   ### Impact
   
   Fix the problem introduced by #7620.
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6030) Cleans the ckp meta while the JM restarts

2023-04-03 Thread Danny Chen (Jira)

Danny Chen created HUDI-6030:


 Summary: Cleans the ckp meta while the JM restarts
 Key: HUDI-6030
 URL: https://issues.apache.org/jira/browse/HUDI-6030
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Danny Chen
 Fix For: 0.13.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] bvaradar commented on pull request #7748: [WIP][HUDI-5560] Make Consistent hash index Bucket Resizing more available…

2023-04-03 Thread via GitHub



bvaradar commented on PR #7748:
URL: https://github.com/apache/hudi/pull/7748#issuecomment-1495325226

   @fengjian428 : Is this RFC ready for review ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #7962: [HUDI-5801] Speed metaTable initializeFileGroups

2023-04-03 Thread via GitHub



bvaradar commented on PR #7962:
URL: https://github.com/apache/hudi/pull/7962#issuecomment-1495324171

   @loukey-lj : Have you seen slowness in metatable initialization in practice 
before. For cases like PARTITION_NAME_FILES metadata, the number of file-groups 
is 1. Running under engine context would result in more overhead for such case. 
   cc @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8335:
URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495315767

   
   ## CI report:
   
   * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096)
 
   * 9bcbb85e4b2bb803e03900b8f01c938833bb1185 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6029) Rollback may omit invalid files when commitMetadata is not completed for MOR

2023-04-03 Thread lei w (Jira)

lei w created HUDI-6029:
---

 Summary: Rollback may omit invalid files when commitMetadata is 
not completed for MOR
 Key: HUDI-6029
 URL: https://issues.apache.org/jira/browse/HUDI-6029
 Project: Apache Hudi
  Issue Type: Bug
Reporter: lei w


Now ,Use listingBasedRollbackStrategy may omit invalid files when 
commitMetadata is not completed.The reason for this problem is due to use 
instantToRollback timestamp and the baseCommitTime of the logFile to judge 
whether the Logfiles is valid.

{code:java}
// commit is instant time which should be rollback
// in most cases  BaseCommitTime  may not equals commit
(path) -> {
  if (path.toString().endsWith(basefileExtension)) {
String fileCommitTime = FSUtils.getCommitTime(path.getName());
return commit.equals(fileCommitTime);
  } else if (FSUtils.isLogFile(path)) {
// Since the baseCommitTime is the only commit for new log files, it's 
okay here
String fileCommitTime = FSUtils.getBaseCommitTimeFromLogPath(path);
return commit.equals(fileCommitTime);
  }
  return false;
};
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] bvaradar commented on pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.

2023-04-03 Thread via GitHub



bvaradar commented on PR #6705:
URL: https://github.com/apache/hudi/pull/6705#issuecomment-1495309319

   @watermelon12138 : Pinging to see if you are interested in updating this PR 
? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495307023

   
   ## CI report:
   
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #7956: [HUDI-5797] fix use bulk insert error as row

2023-04-03 Thread via GitHub



bvaradar commented on PR #7956:
URL: https://github.com/apache/hudi/pull/7956#issuecomment-1495306934

   @KnightChess : I am not sure I understand why this is only the problem with 
bulkInsert as row. Is problem because when doing MDT init, files which are not 
committed (empty/partial) are being added (see 
HoodieBackedTableMetadataWriter.listAllPartitions) . @prashantwason : Can you 
let me know if I am missing something. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495306511

   
   ## CI report:
   
   * 572189472623065f460bd18436fb3b21602449af Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7173: [HUDI-5189] Make HiveAvroSerializer compatible with hive3

2023-04-03 Thread via GitHub



hudi-bot commented on PR #7173:
URL: https://github.com/apache/hudi/pull/7173#issuecomment-1495305780

   
   ## CI report:
   
   * 363aad76c3a145bdd38aa83488efdaa6d5ac1d82 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16012)
 
   * 2ff867c31714270d57518a0c7ca30c7ee98ce612 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16107)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8328: [HUDI-6002] Add JsonSchemaKafkaSource to handle json schema payload

2023-04-03 Thread via GitHub



rmahindra123 commented on code in PR #8328:
URL: https://github.com/apache/hudi/pull/8328#discussion_r1156689995


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonSchemaKafkaSource.java:
##
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.exception.HoodieSourcePostProcessException;
+import org.apache.hudi.utilities.ingestion.HoodieIngestionMetrics;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen;
+import 
org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.node.ObjectNode;
+import org.apache.kafka.clients.consumer.ConsumerRecord;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.streaming.kafka010.KafkaUtils;
+import org.apache.spark.streaming.kafka010.LocationStrategies;
+import org.apache.spark.streaming.kafka010.OffsetRange;
+
+import java.io.IOException;
+import java.util.LinkedHashMap;
+import java.util.LinkedList;
+import java.util.List;
+
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN;
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN;
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN;
+
+public class JsonSchemaKafkaSource extends JsonKafkaSource {

Review Comment:
   +1 looks like a lot of repetitive code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8328: [HUDI-6002] Add JsonSchemaKafkaSource to handle json schema payload

2023-04-03 Thread via GitHub



rmahindra123 commented on code in PR #8328:
URL: https://github.com/apache/hudi/pull/8328#discussion_r1156689995


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonSchemaKafkaSource.java:
##
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.exception.HoodieSourcePostProcessException;
+import org.apache.hudi.utilities.ingestion.HoodieIngestionMetrics;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen;
+import 
org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.node.ObjectNode;
+import org.apache.kafka.clients.consumer.ConsumerRecord;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.streaming.kafka010.KafkaUtils;
+import org.apache.spark.streaming.kafka010.LocationStrategies;
+import org.apache.spark.streaming.kafka010.OffsetRange;
+
+import java.io.IOException;
+import java.util.LinkedHashMap;
+import java.util.LinkedList;
+import java.util.List;
+
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN;
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN;
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN;
+
+public class JsonSchemaKafkaSource extends JsonKafkaSource {

Review Comment:
   +1 looks like a lot of repeatitive code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8102:
URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495279686

   
   ## CI report:
   
   * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097)
 
   * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #7834: [HUDI-5690] Add simpleBucketPartitioner to support using the simple bucket index under bulkinsert

2023-04-03 Thread via GitHub



bvaradar commented on PR #7834:
URL: https://github.com/apache/hudi/pull/7834#issuecomment-1495279011

   @wuwenchi : Can you look at the PR comments and address them when you get a 
chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7173: [HUDI-5189] Make HiveAvroSerializer compatible with hive3

2023-04-03 Thread via GitHub



hudi-bot commented on PR #7173:
URL: https://github.com/apache/hudi/pull/7173#issuecomment-1495278342

   
   ## CI report:
   
   * 363aad76c3a145bdd38aa83488efdaa6d5ac1d82 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16012)
 
   * 2ff867c31714270d57518a0c7ca30c7ee98ce612 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #7942: [HUDI-5753] Add docs for record payload

2023-04-03 Thread via GitHub



codope commented on PR #7942:
URL: https://github.com/apache/hudi/pull/7942#issuecomment-1495277107

   > @codope is it possible you can provide an example to extend the payload 
for a customized option. Also, are there configs the user should consider 
that's provided out-of-the-box? If possible, can you specify all of them inline 
with the right class?
   
   @nfarah86 I have added a link to FAQ where there are more details on how to 
implement a custom payload. I have also removed the record merger API. Need to 
follow up with a separate doc or update this doc in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8373:
URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495274194

   
   ## CI report:
   
   * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8102:
URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495273281

   
   ## CI report:
   
   * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097)
 
   * a66c8ec83a1a8e75d1e28c3e7444b7c3306049a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope merged pull request #7985: [DOCS] Update clustering docs

2023-04-03 Thread via GitHub



codope merged PR #7985:
URL: https://github.com/apache/hudi/pull/7985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [DOCS] Update clustering docs (#7985)

2023-04-03 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 76b212fed0a [DOCS] Update clustering docs (#7985)
76b212fed0a is described below

commit 76b212fed0a766fe0a2edd4c04215bb52e718343
Author: Sagar Sumit 
AuthorDate: Tue Apr 4 08:22:31 2023 +0530

[DOCS] Update clustering docs (#7985)
---
 website/docs/clustering.md | 231 ++---
 .../assets/images/clustering_small_files.gif   | Bin 0 -> 668806 bytes
 website/static/assets/images/clustering_sort.gif   | Bin 0 -> 735437 bytes
 3 files changed, 159 insertions(+), 72 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 9e157de785b..d2ceb196d02 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -10,6 +10,17 @@ last_modified_at:
 Apache Hudi brings stream processing to big data, providing fresh data while 
being an order of magnitude efficient over traditional batch processing. In a 
data lake/warehouse, one of the key trade-offs is between ingestion speed and 
query performance. Data ingestion typically prefers small files to improve 
parallelism and make data available to queries as soon as possible. However, 
query performance degrades poorly with a lot of small files. Also, during 
ingestion, data is typically co-l [...]
 
 
+## How is compaction different from clustering?
+
+Hudi is modeled like a log-structured storage engine with multiple versions of 
the data.
+Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table)
+tables in Hudi store data using a combination of base file in columnar format 
and row-based delta logs that contain
+updates. Compaction is a way to merge the delta logs with base files to 
produce the latest file slices with the most
+recent snapshot of data. Compaction helps to keep the query performance in 
check (larger delta log files would incur
+longer merge times on query side). On the other hand, clustering is a data 
layout optimization technique. One can stitch
+together small files into larger files using clustering. Additionally, data 
can be clustered by sort key so that queries
+can take advantage of data locality.
+
 ## Clustering Architecture
 
 At a high level, Hudi provides different operations such as 
insert/upsert/bulk_insert through it’s write client API to be able to write 
data to a Hudi table. To be able to choose a trade-off between file size and 
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be 
able to configure the smallest allowable file size. Users are able to configure 
the small file [soft 
limit](https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit)
 to `0` to force new [...]
@@ -22,13 +33,13 @@ Clustering table service can run asynchronously or 
synchronously adding a new ac
 
 
 
-### Overall, there are 2 parts to clustering
+### Overall, there are 2 steps to clustering
 
 1.  Scheduling clustering: Create a clustering plan using a pluggable 
clustering strategy.
 2.  Execute clustering: Process the plan using an execution strategy to create 
new files and replace old files.
 
 
-### Scheduling clustering
+### Schedule clustering
 
 Following steps are followed to schedule clustering.
 
@@ -37,7 +48,7 @@ Following steps are followed to schedule clustering.
 3.  Finally, the clustering plan is saved to the timeline in an avro [metadata 
format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc).
 
 
-### Running clustering
+### Execute clustering
 
 1.  Read the clustering plan and get the ‘clusteringGroups’ that mark the file 
groups that need to be clustered.
 2.  For each group, we instantiate appropriate strategy class with 
strategyParams (example: sortColumns) and apply that strategy to rewrite the 
data.
@@ -51,8 +62,147 @@ NOTE: Clustering can only be scheduled for tables / 
partitions not receiving any
 ![Clustering 
example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
-Inline clustering can be setup easily using spark dataframe options. See 
sample below
+## Clustering Usecases
+
+### Batching small files
+
+As mentioned in the intro, streaming ingestion generally results in smaller 
files in your data lake. But having a lot of
+such small files could lead to higher query latency. From our experience 
supporting community users, there are quite a
+few users who are using Hudi just for small file handling capabilities. So, 
you could employ clustering to batch a lot
+of such small files into larger ones.
+
+![Batching small files](/assets/images/clustering_small_files.gif)
+
+### Cluster by sort key
+
+Another classic problem in data lake is the arrival time vs event time 
prob

[GitHub] [hudi] hudi-bot commented on pull request #8373: [HUDI-5955] fix incremental clean not work cause by archive

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8373:
URL: https://github.com/apache/hudi/pull/8373#issuecomment-1495266885

   
   ## CI report:
   
   * 5c05dcc35fa86f5ec823efb52cb3fc48416f4846 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a diff in pull request #6799: [HUDI-4920] fix PartialUpdatePayload cannot return deleted record in …

2023-04-03 Thread via GitHub



bvaradar commented on code in PR #6799:
URL: https://github.com/apache/hudi/pull/6799#discussion_r1156663626


##
hudi-common/src/test/java/org/apache/hudi/common/model/TestPartialUpdateAvroPayload.java:
##
@@ -155,8 +155,8 @@ public void testDeletedRecord() throws IOException {
 PartialUpdateAvroPayload payload1 = new PartialUpdateAvroPayload(record1, 
0L);
 PartialUpdateAvroPayload payload2 = new 
PartialUpdateAvroPayload(delRecord1, 1L);
 
-assertArrayEquals(payload1.preCombine(payload2).recordBytes, 
payload2.recordBytes);
-assertArrayEquals(payload2.preCombine(payload1).recordBytes, 
payload2.recordBytes);
+assertArrayEquals(payload1.preCombine(payload2, schema, new 
Properties()).recordBytes, payload2.recordBytes);
+assertArrayEquals(payload2.preCombine(payload1, schema, new 
Properties()).recordBytes, payload2.recordBytes);
 

Review Comment:
   Can you add an explicit test-case for deleted record case here during 
precombine. The test-case needs to check for _hoodie_is_deleted flag in the 
returned record. 



##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -89,6 +89,8 @@
  */
 public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
 
+  private boolean isPreCombining = false;

Review Comment:
   This member variable needs to be removed as this is no longer used. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] huangxiaopingRD commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]

2023-04-03 Thread via GitHub



huangxiaopingRD commented on issue #8368:
URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495252052

   `spark.sql.hive.metastore.version` is not supported in hudi.  hudi not 
compatible with all hive metastore version like Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on pull request #8232: [HUDI-5955] fix incremental clean not work caused by archive

2023-04-03 Thread via GitHub



hbgstc123 commented on PR #8232:
URL: https://github.com/apache/hudi/pull/8232#issuecomment-1495239401

   https://github.com/apache/hudi/pull/8373
   
   I submit a new pr that fallback to full clean if instant needed for 
incremental clean is archived.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hbgstc123 opened a new pull request, #8373: [HUDI-5955] fix incremental clean not work cause by archive

2023-04-03 Thread via GitHub



hbgstc123 opened a new pull request, #8373:
URL: https://github.com/apache/hudi/pull/8373

   ### Change Logs
   
   Incremental timeline may miss some partition if the instant after "earliest 
retained instant" of last complete clean plan is archived, so fallback to full 
clean if earliest instant to retain is before active timeline.
   
   ### Impact
   
   no
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   no
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LiJie20190102 commented on issue #8331: [SUPPORT] When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle

2023-04-03 Thread via GitHub



LiJie20190102 commented on issue #8331:
URL: https://github.com/apache/hudi/issues/8331#issuecomment-1495234468

   @ad1happy2go  Should we stop SparkContext?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495232941

   
   ## CI report:
   
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058)
 
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16104)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8344:
URL: https://github.com/apache/hudi/pull/8344#issuecomment-1495227355

   
   ## CI report:
   
   * fa1b1525a163af85271f0dc9e0d5765ea2075044 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16058)
 
   * 3c004c60160b06b0f4a7a00980c2013cf21af3c3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495226882

   
   ## CI report:
   
   * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100)
 
   * 572189472623065f460bd18436fb3b21602449af Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8367:
URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495222704

   
   ## CI report:
   
   * 38951b92ba068d155efc85b1b38ce860bf3551d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16091)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16102)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-04-03 Thread via GitHub



hudi-bot commented on PR #7881:
URL: https://github.com/apache/hudi/pull/7881#issuecomment-1495222123

   
   ## CI report:
   
   * c378a74c177a2f1a924609a44f0978ee347d272a UNKNOWN
   * 6fd0ec68de1fc063cc3e79bea173e9f073d4517e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16099)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1495222373

   
   ## CI report:
   
   * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077)
 
   * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100)
 
   * 572189472623065f460bd18436fb3b21602449af UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-5983) Improve loading data via cloud store incr source

2023-04-03 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5983.

Fix Version/s: 0.14.0
 Assignee: Raymond Xu
   Resolution: Fixed

> Improve loading data via cloud store incr source 
> -
>
> Key: HUDI-5983
> URL: https://issues.apache.org/jira/browse/HUDI-5983
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (627b608e3eb -> 5d5658347ad)

2023-04-03 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 627b608e3eb [MINOR] Optimize code style (#8357)
 add 5d5658347ad [HUDI-5983] Improve loading data via cloud store incr 
source (#8290)

No new revisions were added by this update.

Summary of changes:
 .../sources/GcsEventsHoodieIncrSource.java |  36 
 .../sources/S3EventsHoodieIncrSource.java  |  86 +++---
 .../sources/helpers/CloudObjectMetadata.java   |  27 --
 .../helpers/CloudObjectsSelectorCommon.java|  88 +++---
 ...eDataFetcher.java => GcsObjectDataFetcher.java} |  14 +--
 ...sFetcher.java => GcsObjectMetadataFetcher.java} |  42 -
 .../utilities/sources/helpers/gcs/QueryInfo.java   |   2 +-
 .../sources/TestGcsEventsHoodieIncrSource.java | 101 -
 8 files changed, 189 insertions(+), 207 deletions(-)
 copy 
hudi-common/src/main/java/org/apache/hudi/common/function/SerializablePairFlatMapFunction.java
 => 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectMetadata.java
 (69%)
 rename 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/{FileDataFetcher.java
 => GcsObjectDataFetcher.java} (72%)
 rename 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/{FilePathsFetcher.java
 => GcsObjectMetadataFetcher.java} (67%)

[GitHub] [hudi] xushiyan merged pull request #8290: [HUDI-5983] Improve loading data via cloud store incr source

2023-04-03 Thread via GitHub



xushiyan merged PR #8290:
URL: https://github.com/apache/hudi/pull/8290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #8290: [HUDI-5983] Improve loading data via cloud store incr source

2023-04-03 Thread via GitHub



xushiyan commented on PR #8290:
URL: https://github.com/apache/hudi/pull/8290#issuecomment-1495220009

   ![Screenshot 2023-04-03 at 8 41 39 
PM](https://user-images.githubusercontent.com/2701446/229664578-243eaafc-a52f-4e05-b0b1-1f2f4af07e08.png)
   
   CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xuzifu666 commented on pull request #8367: [HUDI-6023] HotFix in HoodieDynamicBoundedBloomFilter with refactor a…

2023-04-03 Thread via GitHub



xuzifu666 commented on PR #8367:
URL: https://github.com/apache/hudi/pull/8367#issuecomment-1495219415

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LiJie20190102 commented on issue #8331: [SUPPORT] When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle

2023-04-03 Thread via GitHub



LiJie20190102 commented on issue #8331:
URL: https://github.com/apache/hudi/issues/8331#issuecomment-1495213134

   > @LiJie20190102 Can you let use know the complete spark-submit command you 
are using.
   
   I found a configuration: "-- post write termination strategy class". I tried 
using the 
'org.apache.hudi.utilities.deltastreamer.NoNewDataTerminationStrategy' to stop 
the task, but it didn't seem to meet my expectations. I think that after it 
stops ExecutorService, the subsequent SparkContext will also stop, but now 
SparkContext will always be started and no subsequent logs will be visible.
   
![image](https://user-images.githubusercontent.com/53458004/229662805-e1b4bfa2-31f6-4ad1-aede-860ecb6af143.png)
   
![image](https://user-images.githubusercontent.com/53458004/229662822-ce078c25-467d-44fa-b286-db5d3d2e8d07.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bigdata-spec commented on issue #8368: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool[SUPPORT]

2023-04-03 Thread via GitHub



bigdata-spec commented on issue #8368:
URL: https://github.com/apache/hudi/issues/8368#issuecomment-1495208401

   @huangxiaopingRD  @ad1happy2go  Thank you for your kindness.
   HMS  version is 2.1.1-cdh6.3.2.
   our environment is cdh6.3.2, we want to use **Apache Spark3.1.1** to replace 
**2.4.0-cdh6.3.2 for spark**
   so I use command：
   
   `./dev/make-distribution.sh --name 3.0.0-cdh6.3.2 --tgz  -Pyarn  
-Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 `
   and  I get **spark-3.1.1-bin-3.0.0-cdh6.3.2.tgz** 
   
   spark-defaults.conf  I set  
   ```
   spark.sql.hive.metastore.version=2.1.1
   
spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hive/lib/*
   ```
   it work well for common hive table. but  hudi table can create ,but can't 
insert.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #5391: [HUDI-3945] After the async compaction operation is complete, the task should exit

2023-04-03 Thread via GitHub



rahil-c commented on PR #5391:
URL: https://github.com/apache/hudi/pull/5391#issuecomment-1495174849

   @watermelon12138 What spark version did you encounter the issue which 
prompted you to create the pr? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 closed issue #8370: [SUPPORT]What's the difference between time-travel-query and point-in-time-query in the doc.

2023-04-03 Thread via GitHub



bithw1 closed issue #8370: [SUPPORT]What's the difference between 
time-travel-query and point-in-time-query in the doc. 
URL: https://github.com/apache/hudi/issues/8370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 commented on issue #8370: [SUPPORT]What's the difference between time-travel-query and point-in-time-query in the doc.

2023-04-03 Thread via GitHub



bithw1 commented on issue #8370:
URL: https://github.com/apache/hudi/issues/8370#issuecomment-1495169282

   Thanks @ad1happy2go , we have the same understanding,thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #5391: [HUDI-3945] After the async compaction operation is complete, the task should exit

2023-04-03 Thread via GitHub



rahil-c commented on PR #5391:
URL: https://github.com/apache/hudi/pull/5391#issuecomment-1495127618

   @yihua @xiarixiaoyao  Wanted to get commmunity thoughts if this is safe to 
revert, I also tried the steps mentioned in the JIRA to see if this `sys.exit` 
is required https://issues.apache.org/jira/browse/HUDI-3945  but from my own 
repro without the sys exit call things are working fine similar to what 
@TengHuo mentioned
   
   The concern with this `sys.exit` call can be seen here mentioned in spark 
code  
https://github.com/apache/spark/blob/v3.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L258
 
   
   ```
// If user application is exited ahead of time by calling System.exit(N), 
here mark
   // this application as failed with EXIT_EARLY. For a good 
shutdown, user shouldn't call
   // System.exit(0) to terminate the application.
   ```
   
   This is where the `ApplicationMaster: Final app status: FAILED, exitCode: 
16, (reason: Shutdown hook called before final status was reported.)` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8102: [HUDI-5880] Support partition pruning for flink streaming source in runtime

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8102:
URL: https://github.com/apache/hudi/pull/8102#issuecomment-1495101369

   
   ## CI report:
   
   * be88d99070504f75c88bfcf48b3c078ca93a35df Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16097)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8335: [HUDI-6009] Let the jetty server in TimelineService create daemon threads

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8335:
URL: https://github.com/apache/hudi/pull/8335#issuecomment-1495086954

   
   ## CI report:
   
   * f5ffa39e26536c54bcdd7d29b96b8ef242203b3c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8231: [HUDI-5963] Release 0.13.1 prep

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8231:
URL: https://github.com/apache/hudi/pull/8231#issuecomment-1495035017

   
   ## CI report:
   
   * 1041e445959cf9148ab904b3d456884e0ead7f9e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16095)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8326: [HUDI-6006] Deprecate hoodie.payload.ordering.field

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8326:
URL: https://github.com/apache/hudi/pull/8326#issuecomment-1495026283

   
   ## CI report:
   
   * 4b0c681e00e9ac437a7ff039a0cb827fd5420470 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16094)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1494975080

   
   ## CI report:
   
   * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077)
 
   * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16100)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] peter-mccabe commented on issue #8144: [SUPPORT]Unable to connect to an s3 hudi table

2023-04-03 Thread via GitHub



peter-mccabe commented on issue #8144:
URL: https://github.com/apache/hudi/issues/8144#issuecomment-1494974166

   any update on this? i really need to be able to manage this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1494967724

   
   ## CI report:
   
   * 09d9feab5048d47a149f4088c23af9b5072250fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16077)
 
   * d3e3d9ffd1bf60dabfb36d37133493683ea56a4c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



xushiyan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156432021


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieSimpleDataGenerator.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class HoodieSimpleDataGenerator {

Review Comment:
   `HoodieTestDataGenerator` actually needs an overhaul as the APIs became 
unorganized over the years and hard to use. More importantly, randomness is a 
big cause to flakiness and we need a deterministic data gen more than a random 
data gen for UT/FT scenarios. I can revert this back to using existing data gen 
class and let the future overhaul work cover the new class adoption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1156418648


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java:
##
@@ -44,6 +48,9 @@ public ComplexAvroKeyGenerator(TypedProperties props) {
 
   @Override
   public String getRecordKey(GenericRecord record) {
+if (autoGenerateRecordKeys()) {
+  return StringUtils.EMPTY_STRING;
+}

Review Comment:
   this is kind of unavoidable as of current structure. For eg, even to fetch 
partition path, our KeyGenerator interface, only exposes 
   ```
   HoodieKey getKey(GenericRecord record)
   ```
   
   So, to fetch partition path, we have to call getKey(genRec).getPartitionPath 
and hence I had to return empty string here. we don't want to add a new api to 
the interface just for this purpose. 
   
   Incase of auto key gen flows, we generate the record keys explicitly (not 
via key gen class) and add it to HoodieKey that we materialize in memory for 
all records.
   
   I can sync up w/ you f2f to clarify this. 
   Ideally, we need to have 2 different interfaces. one to generate partition 
path and one to generate record key. and so some of these workarounds may not 
be required. but w/ current structure, we use a single key gen class to 
generate both record keys and partition paths as well. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8344: [HUDI-5968] Fix global index duplicate when update partition

2023-04-03 Thread via GitHub



xushiyan commented on code in PR #8344:
URL: https://github.com/apache/hudi/pull/8344#discussion_r1156417851


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -244,6 +244,12 @@ public class HoodieIndexConfig extends HoodieConfig {
   .defaultValue("true")
   .withDocumentation("Similar to " + 
BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE + ", but for simple index.");
 
+  public static final ConfigProperty GLOBAL_INDEX_DEDUP_PARALLELISM = 
ConfigProperty

Review Comment:
   not very clear at the moment, given this is still tunable depends on the 
data's update ratio. it may stay as a infrequently used one like 
`hoodie.markers.delete.parallelism`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8107: [HUDI-5514] Adding auto generation of record keys support to Hudi/Spark

2023-04-03 Thread via GitHub



nsivabalan commented on code in PR #8107:
URL: https://github.com/apache/hudi/pull/8107#discussion_r1156413723


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##
@@ -82,9 +86,19 @@ object HoodieDatasetBulkInsertHelper
   val keyGenerator =
 ReflectionUtils.loadClass(keyGeneratorClassName, new 
TypedProperties(config.getProps))
   .asInstanceOf[SparkKeyGeneratorInterface]
+  val partitionId = TaskContext.getPartitionId()
+  var rowId = 0
 
   iter.map { row =>
-val recordKey = keyGenerator.getRecordKey(row, schema)
+// auto generate record keys if needed
+val recordKey = if (autoGenerateRecordKeys) {
+  val recKey = HoodieRecord.generateSequenceId(instantTime, 
partitionId, rowId)
+  rowId += 1
+  UTF8String.fromString(recKey)
+}
+else { // else use key generator to fetch record key
+  keyGenerator.getRecordKey(row, schema)

Review Comment:
   for normal ingestion, we don't use empty string. I will respond to your 
question elsewhere (where we return empty string). its not very apparent. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-04-03 Thread via GitHub



yihua commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1156364101


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##
@@ -147,17 +129,16 @@ public class HoodieCleanConfig extends HoodieConfig {
   public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = 
ConfigProperty
   .key("hoodie.cleaner.policy.failed.writes")
   .defaultValue(HoodieFailedWritesCleaningPolicy.EAGER.name())
+  .withEnumDocumentation(HoodieFailedWritesCleaningPolicy.class,
+  "note that LAZY policy is required when multi-writers are enabled.")

Review Comment:
   nit: capitalize the first letter.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##
@@ -83,8 +79,8 @@ public class HoodieBootstrapConfig extends HoodieConfig {
   public static final ConfigProperty KEYGEN_TYPE = ConfigProperty
   .key("hoodie.bootstrap.keygen.type")
   .defaultValue(KeyGeneratorType.SIMPLE.name())
-  .sinceVersion("0.9.0")
-  .withDocumentation("Type of build-in key generator, currently support 
SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
+  .withEnumDocumentation(KeyGeneratorType.class, "Key generator class for 
bootstrap")

Review Comment:
   For the second argument, is the convention to add a period (`.`) at the end 
or not?  I see both in different enum configs.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -3028,11 +2993,11 @@ private void validate() {
   Objects.requireNonNull(writeConfig.getString(BASE_PATH));
   if (writeConfig.isEarlyConflictDetectionEnable()) {
 checkArgument(writeConfig.getString(WRITE_CONCURRENCY_MODE)
-
.equalsIgnoreCase(WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.value()),
+
.equals(WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name()),

Review Comment:
   Same here, could we ignore case as before?



##
hudi-common/src/main/java/org/apache/hudi/common/config/ConfigProperty.java:
##
@@ -139,6 +144,52 @@ public ConfigProperty withDocumentation(String doc) {
 return new ConfigProperty<>(key, defaultValue, docOnDefaultValue, doc, 
sinceVersion, deprecatedVersion, inferFunction, validValues, advanced, 
alternatives);
   }
 
+  public > ConfigProperty withEnumDocumentation(Class 
e) {
+return withEnumDocumentation(e,"");
+  }
+
+  private > boolean isDefaultField(Class e, Field f) {
+if (!hasDefaultValue()) {
+  return false;
+}
+if (defaultValue() instanceof String) {
+  return f.getName().equals(defaultValue());
+}
+return Enum.valueOf(e, f.getName()).equals(defaultValue());
+  }
+
+  public > ConfigProperty withEnumDocumentation(Class 
e, String doc, String... internalOption) {

Review Comment:
   Could we rename this as `withDocumentation` and remove `doc` and 
`internalOption` for simplicity?  `doc` content can be merged to 
`@EnumDescription` . We can mark internal options in the docs.



##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorWaitStrategyType.java:
##
@@ -27,35 +30,50 @@
 /**
  * Enum for the type of waiting strategy in Disruptor Queue.
  */
+@EnumDescription("Type of waiting strategy in the Disruptor Queue")

Review Comment:
   We can keep the docs the same as before for now.  Any docs improvement can 
be in a separate PR.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##
@@ -55,12 +52,10 @@ public class HoodieBootstrapConfig extends HoodieConfig {
 
   public static final ConfigProperty PARTITION_SELECTOR_REGEX_MODE = 
ConfigProperty
   .key("hoodie.bootstrap.mode.selector.regex.mode")
-  .defaultValue(METADATA_ONLY.name())
-  .sinceVersion("0.6.0")
-  .withValidValues(METADATA_ONLY.name(), FULL_RECORD.name())
-  .withDocumentation("Bootstrap mode to apply for partition paths, that 
match regex above. "
-  + "METADATA_ONLY will generate just skeleton base files with 
keys/footers, avoiding full cost of rewriting the dataset. "
-  + "FULL_RECORD will perform a full copy/rewrite of the data as a 
Hudi table.");

Review Comment:
   @jonvex I think @lokeshj1703 means that `avoiding full cost of rewriting the 
dataset` is missing in the new docs to indicate the benefit.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##
@@ -83,8 +79,8 @@ public class HoodieBootstrapConfig extends HoodieConfig {
   public static final ConfigProperty KEYGEN_TYPE = ConfigProperty
   .key("hoodie.bootstrap.keygen.type")
   .defaultValue(KeyGeneratorType.SIMPLE.name())
-  .sinceVersion("0.9.0")
-  .withDocumentation("Type of build-in key generator, currently support 
SIMPLE, COMPLEX, TI

[GitHub] [hudi] hudi-bot commented on pull request #8369: [HUDI-6024] Hotfix in MergeIntoHoodieTableCommand::validate with remo…

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8369:
URL: https://github.com/apache/hudi/pull/8369#issuecomment-1494885734

   
   ## CI report:
   
   * 544fc9fba0dbf84c03353dcdaf52b7409d31af40 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16092)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8128: [HUDI-5782] Tweak defaults and remove unnecessary configs after config review

2023-04-03 Thread via GitHub



hudi-bot commented on PR #8128:
URL: https://github.com/apache/hudi/pull/8128#issuecomment-1494885024

   
   ## CI report:
   
   * fca6d63c9ef24cdd0cfe30060a58430d035e0664 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nfarah86 commented on issue #8365: [SUPPORT] inconsistent Readoptimized view in merge on read table

2023-04-03 Thread via GitHub



nfarah86 commented on issue #8365:
URL: https://github.com/apache/hudi/issues/8365#issuecomment-1494876289

   It's not documented. I'm working on updating documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6028) GCS incr source does not handle pubsub message properly

2023-04-03 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-6028:
-
Sprint: Sprint 2023-04-10

> GCS incr source does not handle pubsub message properly
> ---
>
> Key: HUDI-6028
> URL: https://issues.apache.org/jira/browse/HUDI-6028
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Raymond Xu
>Priority: Major
>
> Gcs event source uses schema converter from spark and won't handle field name 
> with hyphen in nested column. a sample message
> {code:java}
> 23/04/03 19:23:45 DEBUG GcsEventsSource: msg: {
>   "kind": "storage#object",
>   "id": "",
>   "selfLink": "",
>   "name": "",
>   "bucket": "",
>   "generation": "1680505551370137",
>   "metageneration": "1",
>   "contentType": "application/octet-stream",
>   "timeCreated": "2023-04-03T07:05:51.373Z",
>   "updated": "2023-04-03T07:05:51.373Z",
>   "storageClass": "STANDARD",
>   "timeStorageClassUpdated": "2023-04-03T07:05:51.373Z",
>   "size": "6707",
>   "md5Hash": "",
>   "mediaLink": "",
>   "metadata": {
> "goog-reserved-file-mtime": "1680503048"
>   },
>   "crc32c": "",
>   "etag": ""
> }
> {code}
> and it throws
> {code}
> Exception in thread "main" org.apache.avro.SchemaParseException: Illegal 
> character in: goog-reserved-file-mtime
>   at org.apache.avro.Schema.validateName(Schema.java:1571)
>   at org.apache.avro.Schema.access$400(Schema.java:92)
>   at org.apache.avro.Schema$Field.(Schema.java:549)
>   at 
> org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2258)
>   at 
> org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2254)
>   at 
> org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150)
>   at 
> org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2557)
>   at 
> org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:205)
> {code}
> This is a problem with org.apache.spark.sql.avro.SchemaConverters#toAvroType



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xccui closed issue #8305: [SUPPORT] Potential FileSystem http connection leaking

2023-04-03 Thread via GitHub



xccui closed issue #8305: [SUPPORT] Potential FileSystem http connection leaking
URL: https://github.com/apache/hudi/issues/8305


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xccui commented on issue #8305: [SUPPORT] Potential FileSystem http connection leaking

2023-04-03 Thread via GitHub



xccui commented on issue #8305:
URL: https://github.com/apache/hudi/issues/8305#issuecomment-1494861495

   Hi @danny0405, I looked into this again. You are right, `returnContent()` 
will release the connection. Actually, I was misled by the code. There will be 
two `PoolingHttpClientConnectionManager`s at runtime.
   
   ```
   leaseConnection:306, PoolingHttpClientConnectionManager 
(com.amazonaws.thirdparty.apache.http.impl.conn)
   get:282, PoolingHttpClientConnectionManager$1 
(com.amazonaws.thirdparty.apache.http.impl.conn)
   invoke:-1, GeneratedMethodAccessor24 (jdk.internal.reflect)
   invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
   invoke:566, Method (java.lang.reflect)
   invoke:70, ClientConnectionRequestFactory$Handler (com.amazonaws.http.conn)
   get:-1, $Proxy51 (com.amazonaws.http.conn)
   execute:190, MainClientExec 
(com.amazonaws.thirdparty.apache.http.impl.execchain)
   execute:186, ProtocolExec 
(com.amazonaws.thirdparty.apache.http.impl.execchain)
   doExecute:185, InternalHttpClient 
(com.amazonaws.thirdparty.apache.http.impl.client)
   execute:83, CloseableHttpClient 
(com.amazonaws.thirdparty.apache.http.impl.client)
   execute:56, CloseableHttpClient 
(com.amazonaws.thirdparty.apache.http.impl.client)
   execute:72, SdkHttpClient (com.amazonaws.http.apache.client.impl)
   executeOneRequest:1346, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   executeHelper:1157, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   doExecute:814, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   executeWithTimer:781, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   execute:755, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   access$500:715, AmazonHttpClient$RequestExecutor (com.amazonaws.http)
   execute:697, AmazonHttpClient$RequestExecutionBuilderImpl 
(com.amazonaws.http)
   execute:561, AmazonHttpClient (com.amazonaws.http)
   execute:541, AmazonHttpClient (com.amazonaws.http)
   invoke:5456, AmazonS3Client (com.amazonaws.services.s3)
   invoke:5403, AmazonS3Client (com.amazonaws.services.s3)
   getObjectMetadata:1372, AmazonS3Client (com.amazonaws.services.s3)
   lambda$getObjectMetadata$10:2545, S3AFileSystem (org.apache.hadoop.fs.s3a)
   apply:-1, 497983073 (org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$1189)
   retryUntranslated:414, Invoker (org.apache.hadoop.fs.s3a)
   retryUntranslated:377, Invoker (org.apache.hadoop.fs.s3a)
   getObjectMetadata:2533, S3AFileSystem (org.apache.hadoop.fs.s3a)
   getObjectMetadata:2513, S3AFileSystem (org.apache.hadoop.fs.s3a)
   s3GetFileStatus:3776, S3AFileSystem (org.apache.hadoop.fs.s3a)
   innerGetFileStatus:3688, S3AFileSystem (org.apache.hadoop.fs.s3a)
   lambda$getFileStatus$24:3556, S3AFileSystem (org.apache.hadoop.fs.s3a)
   apply:-1, 718057245 (org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$2610)
   lambda$trackDurationOfOperation$5:499, IOStatisticsBinding 
(org.apache.hadoop.fs.statistics.impl)
   apply:-1, 2039613101 
(org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding$$Lambda$1168)
   trackDuration:444, IOStatisticsBinding (org.apache.hadoop.fs.statistics.impl)
   trackDurationAndSpan:2337, S3AFileSystem (org.apache.hadoop.fs.s3a)
   trackDurationAndSpan:2356, S3AFileSystem (org.apache.hadoop.fs.s3a)
   getFileStatus:3554, S3AFileSystem (org.apache.hadoop.fs.s3a)
   lambda$getFileStatus$17:410, HoodieWrapperFileSystem 
(org.apache.hudi.common.fs)
   get:-1, 589863653 
(org.apache.hudi.common.fs.HoodieWrapperFileSystem$$Lambda$2609)
   executeFuncWithTimeMetrics:114, HoodieWrapperFileSystem 
(org.apache.hudi.common.fs)
   getFileStatus:404, HoodieWrapperFileSystem (org.apache.hudi.common.fs)
   checkTableValidity:51, TableNotFoundException (org.apache.hudi.exception)
   :137, HoodieTableMetaClient (org.apache.hudi.common.table)
   newMetaClient:689, HoodieTableMetaClient (org.apache.hudi.common.table)
   access$000:81, HoodieTableMetaClient (org.apache.hudi.common.table)
   build:770, HoodieTableMetaClient$Builder (org.apache.hudi.common.table)
   createMetaClient:277, StreamerUtil (org.apache.hudi.util)
   :118, WriteProfile (org.apache.hudi.sink.partitioner.profile)
   :44, DeltaWriteProfile (org.apache.hudi.sink.partitioner.profile)
   getWriteProfile:75, WriteProfiles (org.apache.hudi.sink.partitioner.profile)
   lambda$singleton$0:64, WriteProfiles 
(org.apache.hudi.sink.partitioner.profile)
   apply:-1, 401283836 
(org.apache.hudi.sink.partitioner.profile.WriteProfiles$$Lambda$3189)
   computeIfAbsent:1134, HashMap (java.util)
   singleton:63, WriteProfiles (org.apache.hudi.sink.partitioner.profile)
   create:56, BucketAssigners (org.apache.hudi.sink.partitioner)
   open:122, BucketAssignFunction (org.apache.hudi.sink.partitioner)
   openFunction:34, FunctionUtils (org.apache.flink.api.common.functions.util)
   open:100, AbstractUdfStreamOperator 
(org.apache.flink.streaming.api.operators)
   open:55, KeyedProcessOperator (org.apache.flink.streaming.api.operators)
   initia

[jira] [Created] (HUDI-6028) GCS incr source does not handle pubsub message properly

2023-04-03 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-6028:


 Summary: GCS incr source does not handle pubsub message properly
 Key: HUDI-6028
 URL: https://issues.apache.org/jira/browse/HUDI-6028
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: Raymond Xu


Gcs event source uses schema converter from spark and won't handle field name 
with hyphen in nested column. a sample message


{code:java}
23/04/03 19:23:45 DEBUG GcsEventsSource: msg: {
  "kind": "storage#object",
  "id": "",
  "selfLink": "",
  "name": "",
  "bucket": "",
  "generation": "1680505551370137",
  "metageneration": "1",
  "contentType": "application/octet-stream",
  "timeCreated": "2023-04-03T07:05:51.373Z",
  "updated": "2023-04-03T07:05:51.373Z",
  "storageClass": "STANDARD",
  "timeStorageClassUpdated": "2023-04-03T07:05:51.373Z",
  "size": "6707",
  "md5Hash": "",
  "mediaLink": "",
  "metadata": {
"goog-reserved-file-mtime": "1680503048"
  },
  "crc32c": "",
  "etag": ""
}
{code}

and it throws

{code}
Exception in thread "main" org.apache.avro.SchemaParseException: Illegal 
character in: goog-reserved-file-mtime
at org.apache.avro.Schema.validateName(Schema.java:1571)
at org.apache.avro.Schema.access$400(Schema.java:92)
at org.apache.avro.Schema$Field.(Schema.java:549)
at 
org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2258)
at 
org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2254)
at 
org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150)
at 
org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2557)
at 
org.apache.hudi.org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$2(SchemaConverters.scala:205)
{code}


This is a problem with org.apache.spark.sql.avro.SchemaConverters#toAvroType



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xccui commented on issue #8060: [SUPPORT] An instant exception occurs when the flink job is restarted

2023-04-03 Thread via GitHub



xccui commented on issue #8060:
URL: https://github.com/apache/hudi/issues/8060#issuecomment-1494850427

   I hit the same issue. Just feel that the current asynchronous operations are 
a bit fragile. I believe sometimes tasks in a Flink job will be in a zombie 
state before they get killed. In that case, Hudi will see multiple writers. If 
we know that could happen, is it possible to avoid it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 >

1 - 100 of 216 matches

Mail list logo