Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9761:
URL: https://github.com/apache/hudi/pull/9761#issuecomment-1760911414

   
   ## CI report:
   
   * be1d48f0e855ddf6deae23b1beea6734d01d7ae7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20172)
 
   * 60875bdf9343f5f48a862393ff5845063cf549d7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760911695

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20314)
 
   * fcc9de9a960da6510ad8b14aae79deab0331c963 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20320)
 
   * ae5b99aeee4353aca027b6d64deb5e8af1fafe83 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6940] OutputStream maybe not close with try block in HoodieHeartbeatClient [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9855:
URL: https://github.com/apache/hudi/pull/9855#discussion_r1357767211


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java:
##
@@ -250,11 +250,8 @@ public boolean isHeartbeatExpired(String instantTime) 
throws IOException {
   }
 
   private void updateHeartbeat(String instantTime) throws 
HoodieHeartbeatException {
-try {
+try (OutputStream outputStream = this.fs.create(new 
Path(heartbeatFolderPath + Path.SEPARATOR + instantTime), true)) {
   Long newHeartbeatTime = System.currentTimeMillis();
-  OutputStream outputStream =
-  this.fs.create(new Path(heartbeatFolderPath + Path.SEPARATOR + 
instantTime), true);
-  outputStream.close();

Review Comment:
   It has been closed in this line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357725006


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -71,6 +71,14 @@ public class HoodieCommonConfig extends HoodieConfig {
   + " operation will fail schema compatibility check. Set this option 
to true will make the newly added "
   + " column nullable to successfully complete the write operation.");
 
+  public static final ConfigProperty ADD_NULL_FOR_DELETED_COLUMNS = 
ConfigProperty

Review Comment:
   If we make this default behavior, pipelines that are designed to fail with 
dropped columns will start ingesting data with new code. Lets make an explicit 
callout in the docs/release notes if we want to make this default behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6940] OutputStream maybe not close with try block in HoodieHeartbeatClient [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9855:
URL: https://github.com/apache/hudi/pull/9855#issuecomment-1760714040

   
   ## CI report:
   
   * 1ea7baca7d8632d1848e5653881a4a3cd7d715e7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20321)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Is `hoodie.datasource.hive_sync.filter_pushdown_enabled` can enable default? [hudi]

2023-10-12 Thread via GitHub


KnightChess commented on issue #9784:
URL: https://github.com/apache/hudi/issues/9784#issuecomment-1760712493

   @boneanxs hi, can you help give a suggestion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6940] OutputStream maybe not close with try block in HoodieHeartbeatClient [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9855:
URL: https://github.com/apache/hudi/pull/9855#issuecomment-1760709149

   
   ## CI report:
   
   * 1ea7baca7d8632d1848e5653881a4a3cd7d715e7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760704098

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20314)
 
   * fcc9de9a960da6510ad8b14aae79deab0331c963 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357713769


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -578,17 +582,25 @@ object HoodieSparkSqlWriter {
   } else {
 if (!shouldValidateSchemasCompatibility) {
   // if no validation is enabled, check for col drop
-  // if col drop is allowed, go ahead. if not, check for 
projection, so that we do not allow dropping cols
-  if (allowAutoEvolutionColumnDrop || 
canProject(latestTableSchema, canonicalizedSourceSchema)) {
+  if (allowAutoEvolutionColumnDrop) {
 canonicalizedSourceSchema
   } else {
-log.error(
-  s"""Incoming batch schema is not compatible with the table's 
one.
-   |Incoming schema ${sourceSchema.toString(true)}
-   |Incoming schema (canonicalized) 
${canonicalizedSourceSchema.toString(true)}
-   |Table's schema ${latestTableSchema.toString(true)}
-   |""".stripMargin)
-throw new SchemaCompatibilityException("Incoming batch schema 
is not compatible with the table's one")
+val reconciledSchema = if (addNullForDeletedColumns) {
+  
AvroSchemaEvolutionUtils.reconcileSchema(canonicalizedSourceSchema, 
latestTableSchema)

Review Comment:
   Since we are using a single AvroSchemaEvolutionUtils.reconcileSchema API for 
both soft(OOB) and hard evolution(schema.on.read), should we just remove the 
code for "DataSourceWriteOptions.RECONCILE_SCHEMA.key()" and make this function 
less cluttered ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6940) OutputStream maybe not close in HoodieHeartbeatClient

2023-10-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6940:
-
Labels: pull-request-available  (was: )

> OutputStream maybe not close in HoodieHeartbeatClient
> -
>
> Key: HUDI-6940
> URL: https://issues.apache.org/jira/browse/HUDI-6940
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6940] OutputStream maybe not close with try block in HoodieHeartbeatClient [hudi]

2023-10-12 Thread via GitHub


ksmou opened a new pull request, #9855:
URL: https://github.com/apache/hudi/pull/9855

   ### Change Logs
   
   OutputStream maybe not close with try block in HoodieHeartbeatClient.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6940) OutputStream maybe not close in HoodieHeartbeatClient

2023-10-12 Thread kwang (Jira)
kwang created HUDI-6940:
---

 Summary: OutputStream maybe not close in HoodieHeartbeatClient
 Key: HUDI-6940
 URL: https://issues.apache.org/jira/browse/HUDI-6940
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: kwang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357686036


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaCompatibility.java:
##
@@ -372,7 +372,8 @@ private SchemaCompatibilityResult 
calculateCompatibility(final Schema reader, fi
 return (writer.getType() == Type.STRING) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
   }
   case STRING: {
-return (writer.getType() == Type.BYTES) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
+return ((writer.getType() == Type.BYTES) || (writer.getType() == 
Type.INT) || (writer.getType() == Type.LONG)

Review Comment:
   Maybe we can have a utilities to decide whether the given type belongs to a 
NUMERIC type.



##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaCompatibility.java:
##
@@ -372,7 +372,8 @@ private SchemaCompatibilityResult 
calculateCompatibility(final Schema reader, fi
 return (writer.getType() == Type.STRING) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
   }
   case STRING: {
-return (writer.getType() == Type.BYTES) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
+return ((writer.getType() == Type.BYTES) || (writer.getType() == 
Type.INT) || (writer.getType() == Type.LONG)

Review Comment:
   Maybe we can have a utility to decide whether the type is a numeric data 
type.



##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -71,6 +71,14 @@ public class HoodieCommonConfig extends HoodieConfig {
   + " operation will fail schema compatibility check. Set this option 
to true will make the newly added "
   + " column nullable to successfully complete the write operation.");
 
+  public static final ConfigProperty ADD_NULL_FOR_DELETED_COLUMNS = 
ConfigProperty

Review Comment:
   +1 for this to be default behavior or just fail the write operation.



##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -1131,6 +1196,25 @@ private static Schema getActualSchemaFromUnion(Schema 
schema, Object data) {
 return actualSchema;
   }
 
+  private static Schema getActualSchemaFromUnion(Schema schema) {
+Schema actualSchema;
+if (schema.getType() != UNION) {
+  return schema;
+}
+if (schema.getTypes().size() == 2
+&& schema.getTypes().get(0).getType() == Schema.Type.NULL) {
+  actualSchema = schema.getTypes().get(1);
+} else if (schema.getTypes().size() == 2
+&& schema.getTypes().get(1).getType() == Schema.Type.NULL) {
+  actualSchema = schema.getTypes().get(0);
+} else if (schema.getTypes().size() == 1) {
+  actualSchema = schema.getTypes().get(0);

Review Comment:
   +1



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -206,6 +213,9 @@ public IndexedRecord next() {
 IndexedRecord record = this.reader.read(null, decoder);
 this.dis.skipBytes(recordLength);
 this.readRecords++;
+if (this.promotedSchema.isPresent()) {
+  return  HoodieAvroUtils.rewriteRecordWithNewSchema(record, 
this.promotedSchema.get());

Review Comment:
   Is it possible we read the record with the `promotedSchema` directly so that 
there is no need to rewrite each record ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357700698


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -116,9 +116,26 @@ public static String getAvroRecordQualifiedName(String 
tableName) {
 return "hoodie." + sanitizedTableName + "." + sanitizedTableName + 
"_record";
   }
 
+
+  /**
+   * Validate whether the {@code targetSchema} is a valid evolution of {@code 
sourceSchema}.
+   * Basically {@link #isCompatibleProjectionOf(Schema, Schema)} but type 
promotion in the
+   * opposite direction
+   */
+  public static boolean isValidEvolutionOf(Schema sourceSchema, Schema 
targetSchema) {

Review Comment:
   @jonvex , can we do 2 kinds of evolution in a single batch eg., drop column 
and nested type evolution ? I think we need to test these scenarios so we can 
set the correct expectation in the product



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1760681513

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * 4b4d15361b3096e27cc3c54c3347c2cf9224c895 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20301)
 
   * c930542593c9ff008a3d5cee08be617fc7fbfcb0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20319)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357692485


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -68,6 +68,17 @@ public static Schema convert(InternalSchema internalSchema, 
String name) {
 return buildAvroSchemaFromInternalSchema(internalSchema, name);
   }
 
+  /**
+   * Convert avro Schema to avro Schema.
+   *
+   * @param internalSchema internal schema.
+   * @param name the record name.
+   * @return an avro Schema.
+   */
+  public static Schema fixNullOrdering(Schema schema) {

Review Comment:
   yes, an example in the comment as to why we need this conversion will be 
helpful for other readers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357689724


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -116,9 +116,24 @@ public static String getAvroRecordQualifiedName(String 
tableName) {
 return "hoodie." + sanitizedTableName + "." + sanitizedTableName + 
"_record";
   }
 
+  /**
+   * Validate whether the {@code targetSchema} is a valid evolution of {@code 
sourceSchema}.
+   * Basically {@link #isCompatibleProjectionOf(Schema, Schema)} but type 
promotion in the
+   * opposite direction
+   */
+  public static boolean isValidEvolutionOf(Schema sourceSchema, Schema 
targetSchema) {
+return isProjectionOfInternal(sourceSchema, targetSchema,
+AvroSchemaUtils::isAtomicSchemasCompatibleEvolution);
+  }
+
+  private static boolean isAtomicSchemasCompatibleEvolution(Schema 
oneAtomicType, Schema anotherAtomicType) {
+// NOTE: Checking for compatibility of atomic types, we should ignore their
+//   corresponding fully-qualified names (as irrelevant)
+return isSchemaCompatible(anotherAtomicType, oneAtomicType, false, true);

Review Comment:
   @jonvex - "canProject" api call inside this function only check first level 
fields. If there is an invalid evolution in the nested fields will this fail ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760676784

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20314)
 
   * fcc9de9a960da6510ad8b14aae79deab0331c963 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357689724


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -116,9 +116,24 @@ public static String getAvroRecordQualifiedName(String 
tableName) {
 return "hoodie." + sanitizedTableName + "." + sanitizedTableName + 
"_record";
   }
 
+  /**
+   * Validate whether the {@code targetSchema} is a valid evolution of {@code 
sourceSchema}.
+   * Basically {@link #isCompatibleProjectionOf(Schema, Schema)} but type 
promotion in the
+   * opposite direction
+   */
+  public static boolean isValidEvolutionOf(Schema sourceSchema, Schema 
targetSchema) {
+return isProjectionOfInternal(sourceSchema, targetSchema,
+AvroSchemaUtils::isAtomicSchemasCompatibleEvolution);
+  }
+
+  private static boolean isAtomicSchemasCompatibleEvolution(Schema 
oneAtomicType, Schema anotherAtomicType) {
+// NOTE: Checking for compatibility of atomic types, we should ignore their
+//   corresponding fully-qualified names (as irrelevant)
+return isSchemaCompatible(anotherAtomicType, oneAtomicType, false, true);

Review Comment:
   @jonvex - "canProject" api call inside this function only check first level 
fields. If there is an invalid evolution in the nested fields will this pass ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1760676179

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * 4b4d15361b3096e27cc3c54c3347c2cf9224c895 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20301)
 
   * c930542593c9ff008a3d5cee08be617fc7fbfcb0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6480] Flink support non-blocking concurrency control [hudi]

2023-10-12 Thread via GitHub


beyond1920 commented on PR #9850:
URL: https://github.com/apache/hudi/pull/9850#issuecomment-1760672959

   @danny0405 Thanks for review. I would add more tests soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760671544

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20314)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] persist write status RDD in spark compaction job caused the resources could not be released in time [hudi]

2023-10-12 Thread via GitHub


beyond1920 commented on issue #9591:
URL: https://github.com/apache/hudi/issues/9591#issuecomment-1760657182

   @KnightChess Not yet. Waiting for confirm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6669] HoodieEngineContext should not use parallel stream with parallelism greater than CPU cores [hudi]

2023-10-12 Thread via GitHub


SteNicholas closed pull request #9395: [HUDI-6669] HoodieEngineContext should 
not use parallel stream with parallelism greater than CPU cores
URL: https://github.com/apache/hudi/pull/9395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-12 Thread via GitHub


stream2000 commented on code in PR #9118:
URL: https://github.com/apache/hudi/pull/9118#discussion_r1357668260


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkStreamWriteMetrics.java:
##
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.sink.common.AbstractStreamWriteFunction;
+
+import com.codahale.metrics.SlidingWindowReservoir;
+import org.apache.flink.dropwizard.metrics.DropwizardHistogramWrapper;
+import org.apache.flink.dropwizard.metrics.DropwizardMeterWrapper;
+import org.apache.flink.metrics.Histogram;
+import org.apache.flink.metrics.Meter;
+import org.apache.flink.metrics.MetricGroup;
+
+/**
+ * Metrics for flink stream write (including append write, normal/bucket 
stream write etc.).
+ * Used in subclasses of {@link AbstractStreamWriteFunction}.
+ */
+public class FlinkStreamWriteMetrics extends HoodieFlinkMetrics {
+  private static final String DATA_FLUSH_KEY = "data_flush";
+  private static final String FILE_FLUSH_KEY = "file_flush";
+  private static final String HANDLE_CREATION_KEY = "handle_creation";
+
+  /**
+   * Flush data costs during checkpoint.
+   */
+  private long dataFlushCosts;
+
+  /**
+   * Number of records written in during a checkpoint window.
+   */
+  protected long writtenRecords;
+
+  /**
+   * Current write buffer size in StreamWriteFunction.
+   */
+  private long writeBufferedSize;
+
+  /**
+   * Total costs for closing write handles during a checkpoint window.
+   */
+  private long fileFlushTotalCosts;
+
+  /**
+   * Number of handles opened during a checkpoint window. Increased with 
partition number/bucket number etc.
+   */
+  private long numOfOpenHandle;
+
+  /**
+   * Number of files written during a checkpoint window.
+   */
+  private long numOfFilesWritten;
+
+  /**
+   * Number of records written per seconds.
+   */
+  protected final Meter recordWrittenPerSecond;
+
+  /**
+   * Number of write handle switches per seconds.
+   */
+  private final Meter handleSwitchPerSecond;
+
+  /**
+   * Cost of write handle creation.
+   */
+  private final Histogram handleCreationCosts;
+
+  /**
+   * Cost of a file flush.
+   */
+  private final Histogram fileFlushCost;
+
+  public FlinkStreamWriteMetrics(MetricGroup metricGroup) {
+super(metricGroup);
+this.recordWrittenPerSecond = new DropwizardMeterWrapper(new 
com.codahale.metrics.Meter());
+this.handleSwitchPerSecond = new DropwizardMeterWrapper(new 
com.codahale.metrics.Meter());
+this.handleCreationCosts = new DropwizardHistogramWrapper(new 
com.codahale.metrics.Histogram(new SlidingWindowReservoir(100)));
+this.fileFlushCost = new DropwizardHistogramWrapper(new 
com.codahale.metrics.Histogram(new SlidingWindowReservoir(100)));
+  }
+
+  @Override
+  public void registerMetrics() {
+metricGroup.meter("recordWrittenPerSecond", recordWrittenPerSecond);
+metricGroup.gauge("currentCommitWrittenRecords", () -> writtenRecords);
+metricGroup.gauge("dataFlushCosts", () -> dataFlushCosts);
+metricGroup.gauge("writeBufferedSize", () -> writeBufferedSize);
+
+metricGroup.gauge("fileFlushTotalCosts", () -> fileFlushTotalCosts);
+metricGroup.gauge("numOfFilesWritten", () -> numOfFilesWritten);
+metricGroup.gauge("numOfOpenHandle", () -> numOfOpenHandle);
+
+metricGroup.meter("handleSwitchPerSecond", handleSwitchPerSecond);
+
+metricGroup.histogram("handleCreationCosts", handleCreationCosts);
+metricGroup.histogram("handleCloseCosts", fileFlushCost);
+  }

Review Comment:
   Sure, also checked the javadoc. Thanks for point this out~ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props instantiation (#9853)

2023-10-12 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 58f02e17db2 Follow up HUDI-6937, fix the RealtimeCompactedRecordReader 
props instantiation (#9853)
58f02e17db2 is described below

commit 58f02e17db22b5aa0a11a3e0ca1357fa069f98b2
Author: Danny Chan 
AuthorDate: Fri Oct 13 09:58:00 2023 +0800

Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props 
instantiation (#9853)
---
 .../org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
index ba9c2b9a5a7..c5432455cda 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
@@ -22,7 +22,6 @@ import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.config.HoodieCommonConfig;
 import org.apache.hudi.common.config.HoodieMemoryConfig;
 import org.apache.hudi.common.config.HoodieReaderConfig;
-import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
 import org.apache.hudi.common.model.HoodieAvroRecordMerger;
@@ -198,7 +197,7 @@ public class RealtimeCompactedRecordReader extends 
AbstractRealtimeRecordReader
 GenericRecord genericRecord = 
HiveAvroSerializer.rewriteRecordIgnoreResultCheck(oldRecord, 
getLogScannerReaderSchema());
 HoodieRecord record = new HoodieAvroIndexedRecord(genericRecord);
 Option> mergeResult = 
HoodieAvroRecordMerger.INSTANCE.merge(record,
-genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), new 
TypedProperties(payloadProps));
+genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), 
payloadProps);
 return mergeResult.map(p -> (HoodieAvroIndexedRecord) p.getLeft());
   }
 



Re: [PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


danny0405 merged PR #9853:
URL: https://github.com/apache/hudi/pull/9853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6939) Add async archiving for Flink

2023-10-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6939.

Resolution: Fixed

Fixed via master branch: 64d231c8fecd15835fb63637e1c7db36db6120f8

> Add async archiving for Flink
> -
>
> Key: HUDI-6939
> URL: https://issues.apache.org/jira/browse/HUDI-6939
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (af2f74ba50c -> 64d231c8fec)

2023-10-12 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from af2f74ba50c [HUDI-6937] CopyOnWriteInsertHandler#consume cause 
clustering performance degradation (#9851)
 add 64d231c8fec [HUDI-6939] Add async archiving for Flink (#9854)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java | 1 +
 1 file changed, 1 insertion(+)



Re: [PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


danny0405 merged PR #9854:
URL: https://github.com/apache/hudi/pull/9854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


the-other-tim-brown commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357635961


##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -1131,6 +1196,25 @@ private static Schema getActualSchemaFromUnion(Schema 
schema, Object data) {
 return actualSchema;
   }
 
+  private static Schema getActualSchemaFromUnion(Schema schema) {
+Schema actualSchema;
+if (schema.getType() != UNION) {
+  return schema;
+}
+if (schema.getTypes().size() == 2
+&& schema.getTypes().get(0).getType() == Schema.Type.NULL) {
+  actualSchema = schema.getTypes().get(1);
+} else if (schema.getTypes().size() == 2
+&& schema.getTypes().get(1).getType() == Schema.Type.NULL) {
+  actualSchema = schema.getTypes().get(0);
+} else if (schema.getTypes().size() == 1) {
+  actualSchema = schema.getTypes().get(0);

Review Comment:
   Is there any way to share this logic with the method above?



##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -71,6 +71,14 @@ public class HoodieCommonConfig extends HoodieConfig {
   + " operation will fail schema compatibility check. Set this option 
to true will make the newly added "
   + " column nullable to successfully complete the write operation.");
 
+  public static final ConfigProperty ADD_NULL_FOR_DELETED_COLUMNS = 
ConfigProperty

Review Comment:
   What would the behavior be when this is false and schema evolution is 
enabled? Is there an option where it would auto-drop the column in the target 
table?



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##
@@ -111,17 +111,21 @@ public static InternalSchema reconcileSchema(Schema 
incomingSchema, InternalSche
 return 
SchemaChangeUtils.applyTableChanges2Schema(internalSchemaAfterAddColumns, 
typeChange);
   }
 
+  public static Schema reconcileSchema(Schema incomingSchema, Schema 
oldTableSchema) {
+return convert(reconcileSchema(incomingSchema, convert(oldTableSchema)), 
oldTableSchema.getFullName());
+  }
+
   /**
-   * Reconciles nullability requirements b/w {@code source} and {@code target} 
schemas,
+   * Reconciles nullability and datatype requirements b/w {@code source} and 
{@code target} schemas,
* by adjusting these of the {@code source} schema to be in-line with the 
ones of the
* {@code target} one
*
* @param sourceSchema source schema that needs reconciliation
* @param targetSchema target schema that source schema will be reconciled 
against
* @param opts config options
-   * @return schema (based off {@code source} one) that has nullability 
constraints reconciled
+   * @return schema (based off {@code source} one) that has nullability 
constraints and datatypes reconciled
*/
-  public static Schema reconcileNullability(Schema sourceSchema, Schema 
targetSchema, Map opts) {
+  public static Schema reconcileSchemaRequirements(Schema sourceSchema, Schema 
targetSchema, Map opts) {

Review Comment:
   Do we have unit testing on this?



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -68,6 +68,17 @@ public static Schema convert(InternalSchema internalSchema, 
String name) {
 return buildAvroSchemaFromInternalSchema(internalSchema, name);
   }
 
+  /**
+   * Convert avro Schema to avro Schema.
+   *
+   * @param internalSchema internal schema.
+   * @param name the record name.
+   * @return an avro Schema.
+   */
+  public static Schema fixNullOrdering(Schema schema) {

Review Comment:
   Why do we need to do this conversion?



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -661,6 +652,35 @@ private Pair>> fetchFromSourc
 return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
   }
 
+  /**
+   * Apply schema reconcile and schema evolution rules(schema on read) and 
generate new target schema provider.
+   *
+   * @param incomingSchema schema of the source data
+   * @param sourceSchemaProvider Source schema provider.
+   * @return the SchemaProvider that can be used as writer schema.
+   */
+  private SchemaProvider getDeducedSchemaProvider(Schema incomingSchema, 
SchemaProvider sourceSchemaProvider) {

Review Comment:
   Let's try to step through a case that can hit that path today.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in 

Re: [PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9854:
URL: https://github.com/apache/hudi/pull/9854#discussion_r1357641540


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java:
##
@@ -309,6 +309,7 @@ public void preTxn(HoodieTableMetaClient metaClient) {
   this.lastCompletedTxnAndMetadata = 
TransactionUtils.getLastCompletedTxnInstantAndMetadata(metaClient);
   this.pendingInflightAndRequestedInstants = 
TransactionUtils.getInflightAndRequestedInstants(metaClient);
 }
+tableServiceClient.startAsyncArchiveService(this);

Review Comment:
   We have `TestAsyncArchiveService` for UT test, but we do not have engine 
specific tests for async archiving.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9853:
URL: https://github.com/apache/hudi/pull/9853#discussion_r1357639219


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java:
##
@@ -198,7 +197,7 @@ private Option 
mergeRecord(HoodieRecord newRecord, A
 GenericRecord genericRecord = 
HiveAvroSerializer.rewriteRecordIgnoreResultCheck(oldRecord, 
getLogScannerReaderSchema());
 HoodieRecord record = new HoodieAvroIndexedRecord(genericRecord);
 Option> mergeResult = 
HoodieAvroRecordMerger.INSTANCE.merge(record,
-genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), new 
TypedProperties(payloadProps));
+genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), 
payloadProps);

Review Comment:
   `TypedProperties` itself is serializable, it should be a mistake, in 
HUDI-6937, we found a performance regression of almost 100% for parquet file 
from insert handle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


codope commented on code in PR #9853:
URL: https://github.com/apache/hudi/pull/9853#discussion_r1357636337


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java:
##
@@ -198,7 +197,7 @@ private Option 
mergeRecord(HoodieRecord newRecord, A
 GenericRecord genericRecord = 
HiveAvroSerializer.rewriteRecordIgnoreResultCheck(oldRecord, 
getLogScannerReaderSchema());
 HoodieRecord record = new HoodieAvroIndexedRecord(genericRecord);
 Option> mergeResult = 
HoodieAvroRecordMerger.INSTANCE.merge(record,
-genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), new 
TypedProperties(payloadProps));
+genericRecord.getSchema(), newRecord, getLogScannerReaderSchema(), 
payloadProps);

Review Comment:
   any reason why we were wrapping in TypedProperties before? Does it need to 
be serializable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9854:
URL: https://github.com/apache/hudi/pull/9854#issuecomment-1760628292

   
   ## CI report:
   
   * 72c64df16ef6b0becb610c70067739b375f4d0bf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20316)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


codope commented on code in PR #9854:
URL: https://github.com/apache/hudi/pull/9854#discussion_r1357635870


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java:
##
@@ -309,6 +309,7 @@ public void preTxn(HoodieTableMetaClient metaClient) {
   this.lastCompletedTxnAndMetadata = 
TransactionUtils.getLastCompletedTxnInstantAndMetadata(metaClient);
   this.pendingInflightAndRequestedInstants = 
TransactionUtils.getInflightAndRequestedInstants(metaClient);
 }
+tableServiceClient.startAsyncArchiveService(this);

Review Comment:
   don't have any test for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


the-other-tim-brown commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357634303


##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -1081,6 +1083,69 @@ private static Object 
rewritePrimaryTypeWithDiffSchemaType(Object oldValue, Sche
 throw new AvroRuntimeException(String.format("cannot support rewrite value 
for schema type: %s since the old schema type is: %s", newSchema, oldSchema));
   }
 
+  /**
+   * Avro does not support type promotion from numbers to string. This 
function returns true if
+   * it will be necessary to rewrite the record to support this promotion.
+   * NOTE: this does not determine whether the writerSchema and readerSchema 
are compatible.
+   * It is just trying to find if the reader expects a number to be promoted 
to string, as quick as possible.
+   */
+  public static boolean recordNeedsRewriteForExtendedAvroTypePromotion(Schema 
writerSchema, Schema readerSchema) {
+if (writerSchema.equals(readerSchema)) {
+  return false;
+}
+switch (readerSchema.getType()) {
+  case RECORD:
+Map writerFields = new HashMap<>();
+for (Schema.Field field : writerSchema.getFields()) {
+  writerFields.put(field.name(), field);
+}
+for (Schema.Field field : readerSchema.getFields()) {
+  if (writerFields.containsKey(field.name())) {
+if 
(recordNeedsRewriteForExtendedAvroTypePromotion(writerFields.get(field.name()).schema(),
 field.schema())) {
+  return true;
+}

Review Comment:
   simply return the result of 
`recordNeedsRewriteForExtendedAvroTypePromotion(writerFields.get(field.name()).schema(),
 field.schema())` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


the-other-tim-brown commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357634303


##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -1081,6 +1083,69 @@ private static Object 
rewritePrimaryTypeWithDiffSchemaType(Object oldValue, Sche
 throw new AvroRuntimeException(String.format("cannot support rewrite value 
for schema type: %s since the old schema type is: %s", newSchema, oldSchema));
   }
 
+  /**
+   * Avro does not support type promotion from numbers to string. This 
function returns true if
+   * it will be necessary to rewrite the record to support this promotion.
+   * NOTE: this does not determine whether the writerSchema and readerSchema 
are compatible.
+   * It is just trying to find if the reader expects a number to be promoted 
to string, as quick as possible.
+   */
+  public static boolean recordNeedsRewriteForExtendedAvroTypePromotion(Schema 
writerSchema, Schema readerSchema) {
+if (writerSchema.equals(readerSchema)) {
+  return false;
+}
+switch (readerSchema.getType()) {
+  case RECORD:
+Map writerFields = new HashMap<>();
+for (Schema.Field field : writerSchema.getFields()) {
+  writerFields.put(field.name(), field);
+}
+for (Schema.Field field : readerSchema.getFields()) {
+  if (writerFields.containsKey(field.name())) {
+if 
(recordNeedsRewriteForExtendedAvroTypePromotion(writerFields.get(field.name()).schema(),
 field.schema())) {
+  return true;
+}

Review Comment:
   simply return the result of 
`recordNeedsRewriteForExtendedAvroTypePromotion(writerFields.get(field.name()).schema(),
 field.schema())` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6480] Flink support non-blocking concurrency control [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9850:
URL: https://github.com/apache/hudi/pull/9850#discussion_r1357634569


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteCopyOnWrite.java:
##
@@ -540,9 +541,7 @@ public void testWriteMultiWriterInvolved() throws Exception 
{
 .checkWrittenData(EXPECTED3, 1);
 // step to commit the 2nd txn, should throw exception
 // for concurrent modification of same fileGroups
-pipeline1.checkpoint(1)
-.assertNextEvent()
-.checkpointCompleteThrows(1, HoodieWriteConflictException.class, 
"Cannot resolve conflicts");
+testConcurrentCommit(pipeline1);

Review Comment:
   Can we add new tests instead of ammending existing one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6480] Flink support non-blocking concurrency control [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9850:
URL: https://github.com/apache/hudi/pull/9850#discussion_r1357634270


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java:
##
@@ -213,6 +213,14 @@ protected Map getMiniBatchExpected() {
 return expected;
   }
 
+  @Override
+  protected void testConcurrentCommit(TestHarness pipeline) throws Exception {
+pipeline.checkpoint(1)
+.assertNextEvent()

Review Comment:
   I'm confused by the method name, is it a concurrent commit? The pipeline 
only commits 1 commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6480] Flink support non-blocking concurrency control [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9850:
URL: https://github.com/apache/hudi/pull/9850#discussion_r1357633319


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java:
##
@@ -524,9 +528,7 @@ private void checkInstantState(HoodieInstant.State state, 
String instantStr) {
 }
 
 protected String lastCompleteInstant() {
-  return OptionsResolver.isMorTable(conf)
-  ? TestUtils.getLastDeltaCompleteInstant(basePath)
-  : TestUtils.getLastCompleteInstant(basePath, 
HoodieTimeline.COMMIT_ACTION);
+  return this.ckpMetadata.lastCompleteInstant();

Review Comment:
   Why we must fetch the instant from ckp metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9854:
URL: https://github.com/apache/hudi/pull/9854#issuecomment-1760596242

   
   ## CI report:
   
   * 72c64df16ef6b0becb610c70067739b375f4d0bf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760596145

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * df93e1373c94d9988ac002dd2155eba0008031d3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20312)
 
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20314)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9853:
URL: https://github.com/apache/hudi/pull/9853#issuecomment-1760596194

   
   ## CI report:
   
   * 7882181bc36094e7557e15237991a337d9518b79 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20315)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357608272


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java:
##
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.table.log.KeySpec;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+
+public interface HoodieFileGroupRecordBuffer {
+  enum BufferType {
+KEY_BASED,
+POSITION_BASED
+  }

Review Comment:
   Maybe we can have a better name: xxxBlockProcessor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357606976


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -128,6 +137,33 @@ public HoodieFileGroupReader(HoodieReaderContext 
readerContext,
 this.readerState.baseFileAvroSchema = avroSchema;
 this.readerState.logRecordAvroSchema = avroSchema;
 this.readerState.mergeProps.putAll(props);
+
+FileSystem fs = readerContext.getFs(logFilePathList.get().get(0), 
hadoopConf);
+HoodieTableMetaClient metaClient = HoodieTableMetaClient
+.builder().setConf(fs.getConf()).setBasePath(tablePath).build();
+
+HoodieFileGroupRecordBuffer keyBasedRecordBuffer = new 
HoodieKeyBasedFileGroupRecordBuffer(
+readerContext,
+readerState.logRecordAvroSchema,
+readerState.baseFileAvroSchema,
+Option.of(getRelativePartitionPath(
+new Path(readerState.tablePath), new 
Path(logFilePathList.get().get(0)).getParent())),
+Option.of(metaClient.getTableConfig().getPartitionFields()),
+recordMerger,
+props,
+metaClient);
+HoodieFileGroupRecordBuffer positionBasedRecordBuffer = new 
HoodiePositionBasedFileGroupRecordBuffer(
+readerContext,

Review Comment:
   Can we move the record buffer instance creation to it's factory class? There 
is no need to create 2 instantces altogether.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9853:
URL: https://github.com/apache/hudi/pull/9853#issuecomment-1760587858

   
   ## CI report:
   
   * 7882181bc36094e7557e15237991a337d9518b79 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760587735

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * df93e1373c94d9988ac002dd2155eba0008031d3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20312)
 
   * e3b8ef2e7f6c0b4155b7a4a5596aa2df558e81a9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6939) Add async archiving for Flink

2023-10-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6939:
-
Labels: pull-request-available  (was: )

> Add async archiving for Flink
> -
>
> Key: HUDI-6939
> URL: https://issues.apache.org/jira/browse/HUDI-6939
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6939] Add async archiving for Flink [hudi]

2023-10-12 Thread via GitHub


danny0405 opened a new pull request, #9854:
URL: https://github.com/apache/hudi/pull/9854

   ### Change Logs
   
   Async archiving for Flink.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6939) Add async archiving for Flink

2023-10-12 Thread Danny Chen (Jira)
Danny Chen created HUDI-6939:


 Summary: Add async archiving for Flink
 Key: HUDI-6939
 URL: https://issues.apache.org/jira/browse/HUDI-6939
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-12 Thread via GitHub


danny0405 commented on code in PR #9118:
URL: https://github.com/apache/hudi/pull/9118#discussion_r1357583122


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkStreamWriteMetrics.java:
##
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.sink.common.AbstractStreamWriteFunction;
+
+import com.codahale.metrics.SlidingWindowReservoir;
+import org.apache.flink.dropwizard.metrics.DropwizardHistogramWrapper;
+import org.apache.flink.dropwizard.metrics.DropwizardMeterWrapper;
+import org.apache.flink.metrics.Histogram;
+import org.apache.flink.metrics.Meter;
+import org.apache.flink.metrics.MetricGroup;
+
+/**
+ * Metrics for flink stream write (including append write, normal/bucket 
stream write etc.).
+ * Used in subclasses of {@link AbstractStreamWriteFunction}.
+ */
+public class FlinkStreamWriteMetrics extends HoodieFlinkMetrics {
+  private static final String DATA_FLUSH_KEY = "data_flush";
+  private static final String FILE_FLUSH_KEY = "file_flush";
+  private static final String HANDLE_CREATION_KEY = "handle_creation";
+
+  /**
+   * Flush data costs during checkpoint.
+   */
+  private long dataFlushCosts;
+
+  /**
+   * Number of records written in during a checkpoint window.
+   */
+  protected long writtenRecords;
+
+  /**
+   * Current write buffer size in StreamWriteFunction.
+   */
+  private long writeBufferedSize;
+
+  /**
+   * Total costs for closing write handles during a checkpoint window.
+   */
+  private long fileFlushTotalCosts;
+
+  /**
+   * Number of handles opened during a checkpoint window. Increased with 
partition number/bucket number etc.
+   */
+  private long numOfOpenHandle;
+
+  /**
+   * Number of files written during a checkpoint window.
+   */
+  private long numOfFilesWritten;
+
+  /**
+   * Number of records written per seconds.
+   */
+  protected final Meter recordWrittenPerSecond;
+
+  /**
+   * Number of write handle switches per seconds.
+   */
+  private final Meter handleSwitchPerSecond;
+
+  /**
+   * Cost of write handle creation.
+   */
+  private final Histogram handleCreationCosts;
+
+  /**
+   * Cost of a file flush.
+   */
+  private final Histogram fileFlushCost;
+
+  public FlinkStreamWriteMetrics(MetricGroup metricGroup) {
+super(metricGroup);
+this.recordWrittenPerSecond = new DropwizardMeterWrapper(new 
com.codahale.metrics.Meter());
+this.handleSwitchPerSecond = new DropwizardMeterWrapper(new 
com.codahale.metrics.Meter());
+this.handleCreationCosts = new DropwizardHistogramWrapper(new 
com.codahale.metrics.Histogram(new SlidingWindowReservoir(100)));
+this.fileFlushCost = new DropwizardHistogramWrapper(new 
com.codahale.metrics.Histogram(new SlidingWindowReservoir(100)));
+  }
+
+  @Override
+  public void registerMetrics() {
+metricGroup.meter("recordWrittenPerSecond", recordWrittenPerSecond);
+metricGroup.gauge("currentCommitWrittenRecords", () -> writtenRecords);
+metricGroup.gauge("dataFlushCosts", () -> dataFlushCosts);
+metricGroup.gauge("writeBufferedSize", () -> writeBufferedSize);
+
+metricGroup.gauge("fileFlushTotalCosts", () -> fileFlushTotalCosts);
+metricGroup.gauge("numOfFilesWritten", () -> numOfFilesWritten);
+metricGroup.gauge("numOfOpenHandle", () -> numOfOpenHandle);
+
+metricGroup.meter("handleSwitchPerSecond", handleSwitchPerSecond);
+
+metricGroup.histogram("handleCreationCosts", handleCreationCosts);
+metricGroup.histogram("handleCloseCosts", fileFlushCost);
+  }

Review Comment:
   handleCloseCosts -> fileFlushCosts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Follow up HUDI-6937, fix the RealtimeCompactedRecordReader props inst… [hudi]

2023-10-12 Thread via GitHub


danny0405 opened a new pull request, #9853:
URL: https://github.com/apache/hudi/pull/9853

   …antiation
   
   ### Change Logs
   
   Fix the RealtimeCompactedRecordReader to not instantiate the props 
per-record.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6937) CopyOnWriteInsertHandler#consume will cause clustering performance degradation

2023-10-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6937.

Resolution: Fixed

Fixed via master branch: af2f74ba50c250b1d830f800d960c01bf66f646a

> CopyOnWriteInsertHandler#consume will cause clustering performance degradation
> --
>
> Key: HUDI-6937
> URL: https://issues.apache.org/jira/browse/HUDI-6937
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: hudi-0.12-flamegraph.png, hudi-0.12-log.png, 
> hudi-0.14-flamegraph.png, hudi-0.14-log.png
>
>
> We upgraded Hudi from 0.12 to 0.14, and found that the offline clustering 
> performance dropped by half. We compared and analyzed this two versions of 
> flame graphs, found TypedProperties object instantiation takes too much time. 
> This will cause clustering performance degradation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6937) CopyOnWriteInsertHandler#consume will cause clustering performance degradation

2023-10-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6937:
-
Fix Version/s: 1.0.0
   0.14.1

> CopyOnWriteInsertHandler#consume will cause clustering performance degradation
> --
>
> Key: HUDI-6937
> URL: https://issues.apache.org/jira/browse/HUDI-6937
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: hudi-0.12-flamegraph.png, hudi-0.12-log.png, 
> hudi-0.14-flamegraph.png, hudi-0.14-log.png
>
>
> We upgraded Hudi from 0.12 to 0.14, and found that the offline clustering 
> performance dropped by half. We compared and analyzed this two versions of 
> flame graphs, found TypedProperties object instantiation takes too much time. 
> This will cause clustering performance degradation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6937] CopyOnWriteInsertHandler#consume cause clustering performance degradation (#9851)

2023-10-12 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new af2f74ba50c [HUDI-6937] CopyOnWriteInsertHandler#consume cause 
clustering performance degradation (#9851)
af2f74ba50c is described below

commit af2f74ba50c250b1d830f800d960c01bf66f646a
Author: ksmou <135721692+ks...@users.noreply.github.com>
AuthorDate: Fri Oct 13 07:50:10 2023 +0800

[HUDI-6937] CopyOnWriteInsertHandler#consume cause clustering performance 
degradation (#9851)
---
 .../java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java   | 3 +--
 .../java/org/apache/hudi/execution/HoodieLazyInsertIterable.java   | 7 ++-
 .../main/java/org/apache/hudi/execution/ExplicitWriteHandler.java  | 3 +--
 .../src/main/java/org/apache/hudi/common/config/HoodieConfig.java  | 2 +-
 .../apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java  | 4 ++--
 5 files changed, 7 insertions(+), 12 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
index 55db97e87a4..fd932a66a0a 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/CopyOnWriteInsertHandler.java
@@ -19,7 +19,6 @@
 package org.apache.hudi.execution;
 
 import org.apache.hudi.client.WriteStatus;
-import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.TaskContextSupplier;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.util.queue.HoodieConsumer;
@@ -95,7 +94,7 @@ public class CopyOnWriteInsertHandler
   record.getPartitionPath(), idPrefix, taskContextSupplier);
   handles.put(partitionPath, handle);
 }
-handle.write(record, genResult.schema, new 
TypedProperties(genResult.props));
+handle.write(record, genResult.schema, config.getProps());
   }
 
   @Override
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java
index e8bf3bb107f..84fea62604a 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java
@@ -31,7 +31,6 @@ import org.apache.hudi.util.ExecutorFactory;
 
 import java.util.Iterator;
 import java.util.List;
-import java.util.Properties;
 import java.util.function.Function;
 
 /**
@@ -77,12 +76,10 @@ public abstract class HoodieLazyInsertIterable
   public static class HoodieInsertValueGenResult {
 private final R record;
 public final Schema schema;
-public final Properties props;
 
-public HoodieInsertValueGenResult(R record, Schema schema, Properties 
properties) {
+public HoodieInsertValueGenResult(R record, Schema schema) {
   this.record = record;
   this.schema = schema;
-  this.props = properties;
 }
 
 public R getResult() {
@@ -112,7 +109,7 @@ public abstract class HoodieLazyInsertIterable
 
 return record -> {
   HoodieRecord clonedRecord = shouldClone ? record.copy() : record;
-  return new HoodieInsertValueGenResult(clonedRecord, schema, 
writeConfig.getProps());
+  return new HoodieInsertValueGenResult(clonedRecord, schema);
 };
   }
 
diff --git 
a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/execution/ExplicitWriteHandler.java
 
b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/execution/ExplicitWriteHandler.java
index 187efd8fc81..59e1e3c6de4 100644
--- 
a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/execution/ExplicitWriteHandler.java
+++ 
b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/execution/ExplicitWriteHandler.java
@@ -19,7 +19,6 @@
 package org.apache.hudi.execution;
 
 import org.apache.hudi.client.WriteStatus;
-import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.util.queue.HoodieConsumer;
 import org.apache.hudi.io.HoodieWriteHandle;
@@ -46,7 +45,7 @@ public class ExplicitWriteHandler
   @Override
   public void 
consume(HoodieLazyInsertIterable.HoodieInsertValueGenResult 
genResult) {
 final HoodieRecord insertPayload = genResult.getResult();
-handle.write(insertPayload, genResult.schema, new 
TypedProperties(genResult.props));
+handle.write(insertPayload, genResult.schema, 
this.handle.getConfig().getProps());
   }
 
   @Override
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java 

Re: [PR] [HUDI-6937]CopyOnWriteInsertHandler#consume cause clustering performance degradation [hudi]

2023-10-12 Thread via GitHub


danny0405 merged PR #9851:
URL: https://github.com/apache/hudi/pull/9851


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760530357

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * df93e1373c94d9988ac002dd2155eba0008031d3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20312)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357490860


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/FileGroupReaderBasedParquetFileFormat.scala:
##
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import kotlin.NotImplementedError
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.engine.HoodieReaderContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieBaseFile, HoodieLogFile, 
HoodieRecord}
+import org.apache.hudi.common.table.read.HoodieFileGroupReader
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.{
+  HoodieBaseRelation, HoodieSparkUtils, HoodieTableSchema, HoodieTableState, 
MergeOnReadSnapshotRelation,
+  PartitionFileSliceMapping, SparkAdapterSupport, 
SparkFileFormatInternalRowReaderContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.isMetaField
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import java.util
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+class FileGroupReaderBasedParquetFileFormat(tableState: 
Broadcast[HoodieTableState],
+tableSchema: 
Broadcast[HoodieTableSchema],
+tableName: String,
+mergeType: String,
+mandatoryFields: Seq[String],
+isMOR: Boolean,
+isBootstrap: Boolean,
+shouldUseRecordPosition: Boolean
+   ) extends ParquetFileFormat with 
SparkAdapterSupport {
+  var isProjected = false
+
+  /**
+   * Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
+   * while the other side can't
+   */
+  private var supportBatchCalled = false
+  private var supportBatchResult = false
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+if (!supportBatchCalled) {
+  supportBatchCalled = true
+  supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema)
+}
+supportBatchResult
+  }
+
+  override def isSplitable(
+sparkSession: SparkSession,
+options: Map[String, String],
+path: Path): Boolean = {
+false
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): 
PartitionedFile => Iterator[InternalRow] = {
+val requiredSchemaWithMandatory = 
generateRequiredSchemaWithMandatory(requiredSchema, dataSchema)
+val requiredSchemaSplits = requiredSchemaWithMandatory.fields.partition(f 
=> HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name))
+val requiredMeta = StructType(requiredSchemaSplits._1)
+val requiredWithoutMeta = StructType(requiredSchemaSplits._2)
+val (baseFileReader, preMergeBaseFileReader, _, 

Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357489742


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/FileGroupReaderBasedParquetFileFormat.scala:
##
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import kotlin.NotImplementedError
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.engine.HoodieReaderContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieBaseFile, HoodieLogFile, 
HoodieRecord}
+import org.apache.hudi.common.table.read.HoodieFileGroupReader
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.{
+  HoodieBaseRelation, HoodieSparkUtils, HoodieTableSchema, HoodieTableState, 
MergeOnReadSnapshotRelation,
+  PartitionFileSliceMapping, SparkAdapterSupport, 
SparkFileFormatInternalRowReaderContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.isMetaField
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import java.util
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+class FileGroupReaderBasedParquetFileFormat(tableState: 
Broadcast[HoodieTableState],

Review Comment:
   Yeah, adding "Hoodie" prefix should be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357485588


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieKeyBasedFileGroupRecordBuffer.java:
##
@@ -0,0 +1,250 @@
+/*
+ *
+ *  * Licensed to the Apache Software Foundation (ASF) under one or more
+ *  * contributor license agreements.  See the NOTICE file distributed with
+ *  * this work for additional information regarding copyright ownership.
+ *  * The ASF licenses this file to You under the Apache License, Version 2.0
+ *  * (the "License"); you may not use this file except in compliance with
+ *  * the License.  You may obtain a copy of the License at
+ *  *
+ *  *http://www.apache.org/licenses/LICENSE-2.0
+ *  *
+ *  * Unless required by applicable law or agreed to in writing, software
+ *  * distributed under the License is distributed on an "AS IS" BASIS,
+ *  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  * See the License for the specific language governing permissions and
+ *  * limitations under the License.
+ *
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.KeySpec;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
+public class HoodieKeyBasedFileGroupRecordBuffer implements 
HoodieFileGroupRecordBuffer {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760406528

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 3402439aad67a67eaf72fe0dff27cfaf8d1bb80e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20299)
 
   * df93e1373c94d9988ac002dd2155eba0008031d3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20312)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357453729


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -102,6 +102,18 @@ public HoodieFileGroupReader(HoodieReaderContext 
readerContext,
 this.start = 0;
 this.length = Long.MAX_VALUE;
 this.baseFileIterator = new EmptyIterator<>();
+this.shouldUseRecordPosition = false;

Review Comment:
   Remove the shouldUseRecordPosition variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357451414


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java:
##
@@ -81,11 +80,16 @@ private HoodieMergedLogRecordReader(HoodieReaderContext 
readerContext,
   Option partitionName,
   InternalSchema internalSchema,
   Option keyFieldOverride,
-  boolean enableOptimizedLogBlocksScan, 
HoodieRecordMerger recordMerger) {
+  boolean enableOptimizedLogBlocksScan,
+  HoodieRecordMerger recordMerger,
+  HoodieFileGroupRecordBuffer 
recordBuffer) {
 super(readerContext, fs, basePath, logFilePaths, readerSchema, 
latestInstantTime, readBlocksLazily, reverseReader, bufferSize,
-instantRange, withOperationField, forceFullScan, partitionName, 
internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan, recordMerger);
+instantRange, withOperationField, forceFullScan, partitionName, 
internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan,
+recordMerger, recordBuffer);
 this.records = new HashMap<>();
 this.scannedPrefixes = new HashSet<>();
+this.recordBuffer = recordBuffer;

Review Comment:
   Removed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1760396425

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 8b312ef45b84d8ed08dde719e60a01e4b7c44cb6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20311)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1760396627

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 3402439aad67a67eaf72fe0dff27cfaf8d1bb80e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20299)
 
   * df93e1373c94d9988ac002dd2155eba0008031d3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6938) Run TPC-DS benchmark on the integration

2023-10-12 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774697#comment-17774697
 ] 

Lin Liu commented on HUDI-6938:
---

Trying to create a testing cluster for the benchmark.

> Run TPC-DS benchmark on the integration
> ---
>
> Key: HUDI-6938
> URL: https://issues.apache.org/jira/browse/HUDI-6938
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1760339498

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * f544ec59661637f87d06a6c88eb5130493608e70 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20308)
 
   * 8b312ef45b84d8ed08dde719e60a01e4b7c44cb6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20311)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1760328788

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * f544ec59661637f87d06a6c88eb5130493608e70 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20308)
 
   * 8b312ef45b84d8ed08dde719e60a01e4b7c44cb6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6937]CopyOnWriteInsertHandler#consume cause clustering performance degradation [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9851:
URL: https://github.com/apache/hudi/pull/9851#issuecomment-1760329103

   
   ## CI report:
   
   * 3ef4ff72aa23c4eceac09e3d09743ff62b6e2cab Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20309)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357386941


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -661,6 +652,35 @@ private Pair>> fetchFromSourc
 return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
   }
 
+  /**
+   * Apply schema reconcile and schema evolution rules(schema on read) and 
generate new target schema provider.
+   *
+   * @param incomingSchema schema of the source data
+   * @param sourceSchemaProvider Source schema provider.
+   * @return the SchemaProvider that can be used as writer schema.
+   */
+  private SchemaProvider getDeducedSchemaProvider(Schema incomingSchema, 
SchemaProvider sourceSchemaProvider) {

Review Comment:
   Yeah, I am confused about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6938) Run TPC-DS benchmark on the integration

2023-10-12 Thread Lin Liu (Jira)
Lin Liu created HUDI-6938:
-

 Summary: Run TPC-DS benchmark on the integration
 Key: HUDI-6938
 URL: https://issues.apache.org/jira/browse/HUDI-6938
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lin Liu
Assignee: Lin Liu
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6938) Run TPC-DS benchmark on the integration

2023-10-12 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6938:
--
Status: In Progress  (was: Open)

> Run TPC-DS benchmark on the integration
> ---
>
> Key: HUDI-6938
> URL: https://issues.apache.org/jira/browse/HUDI-6938
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357354486


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroParquetReader.java:
##
@@ -165,7 +165,10 @@ private ClosableIterator 
getIndexedRecordIteratorInternal(Schema
   AvroReadSupport.setAvroReadSchema(conf, requestedSchema.get());
   AvroReadSupport.setRequestedProjection(conf, requestedSchema.get());
 }
-ParquetReader reader = new 
HoodieAvroParquetReaderBuilder(path).withConf(conf).build();
+ParquetReader reader = new 
HoodieAvroParquetReaderBuilder(path).withConf(conf)
+.set(ParquetInputFormat.STRICT_TYPE_CHECKING, "false")
+.set(AvroSchemaConverter.ADD_LIST_ELEMENT_RECORDS, "false")

Review Comment:
   tryOverrideDefaultConfigs pretty much hardcodes this. But 
tryOverrideDefaultConfigs is wrong. It is not setting the right config. I 
stepped through the code with the debugger and this is how you set it 
correctly. This is no longer relevant for this pr since I am focused on row 
writer enable so maybe I should revert this and make a jira?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357341403


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -153,11 +156,14 @@ protected  ClosableIterator 
deserializeRecords(HoodieReaderContext read
   }
 
   private static class RecordIterator implements 
ClosableIterator {
+
+private final Boolean whichImplementation = false;
 private byte[] content;
 private final SizeAwareDataInputStream dis;
 private final GenericDatumReader reader;
 private final ThreadLocal decoderCache = new 
ThreadLocal<>();
-
+private Option castSchema = Option.empty();

Review Comment:
   Renamed to `promotedSchema` and updated some other method names to make it 
clear that this is for type promotion and not schema evolution as a whole



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357335183


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -153,11 +156,14 @@ protected  ClosableIterator 
deserializeRecords(HoodieReaderContext read
   }
 
   private static class RecordIterator implements 
ClosableIterator {
+
+private final Boolean whichImplementation = false;
 private byte[] content;
 private final SizeAwareDataInputStream dis;
 private final GenericDatumReader reader;
 private final ThreadLocal decoderCache = new 
ThreadLocal<>();
-
+private Option castSchema = Option.empty();
+private UnaryOperator converter;

Review Comment:
   [Avro schema type 
promotion](https://avro.apache.org/docs/1.11.1/specification/) (just ctrl-f for 
promotion in the link) does not accommodate promotion from number types to 
string. This is a common ask so we are adding the feature. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357329648


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -205,6 +224,15 @@ public IndexedRecord next() {
 IndexedRecord record = this.reader.read(null, decoder);
 this.dis.skipBytes(recordLength);
 this.readRecords++;
+if (whichImplementation) {

Review Comment:
   removed whichImplementation, but in the case castSchema is not present then 
we just return the record



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357319904


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaCompatibility.java:
##
@@ -372,7 +372,8 @@ private SchemaCompatibilityResult 
calculateCompatibility(final Schema reader, fi
 return (writer.getType() == Type.STRING) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
   }
   case STRING: {
-return (writer.getType() == Type.BYTES) ? result : 
result.mergedWith(typeMismatch(reader, writer, locations));
+return ((writer.getType() == Type.BYTES) || (writer.getType() == 
Type.INT) || (writer.getType() == Type.LONG)

Review Comment:
   does it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357316577


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroCastingGenericRecord.java:
##
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.avro;
+
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.UnaryOperator;
+
+import static org.apache.avro.Schema.Type.ARRAY;
+import static org.apache.avro.Schema.Type.MAP;
+import static org.apache.avro.Schema.Type.STRING;
+import static org.apache.avro.Schema.Type.UNION;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
+
+
+/**
+ * Implementation of avro generic record that uses casting for implicit schema 
evolution
+ */
+public class AvroCastingGenericRecord implements GenericRecord {
+  private final IndexedRecord record;
+  private final Schema readerSchema;
+  private final Map> fieldConverters;
+
+  private AvroCastingGenericRecord(IndexedRecord record, Schema readerSchema, 
Map> fieldConverters) {
+this.record = record;
+this.readerSchema = readerSchema;
+this.fieldConverters = fieldConverters;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public Object get(int i) {
+if (fieldConverters.containsKey(i)) {
+  return fieldConverters.get(i).apply(record.get(i));
+}
+return record.get(i);
+  }
+
+  @Override
+  public Schema getSchema() {
+return readerSchema;
+  }
+
+  @Override
+  public void put(String key, Object v) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+put(field.pos(), v);
+  }
+
+  @Override
+  public Object get(String key) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+return get(field.pos());
+  }
+
+
+  /**
+   * Avro schema evolution does not support promotion from numbers to string. 
This function returns true if
+   * it will be necessary to rewrite the record to support the evolution.
+   * NOTE: this does not determine whether the schema evolution is valid. It 
is just trying to find if the schema evolves from
+   * a number to string, as quick as possible.
+   */
+  public static Boolean 
recordNeedsRewriteForExtendedAvroSchemaEvolution(Schema writerSchema, Schema 
readerSchema) {

Review Comment:
   The deltastreamer schema evolution tests contain end to end testing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357315992


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroCastingGenericRecord.java:
##
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.avro;
+
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.UnaryOperator;
+
+import static org.apache.avro.Schema.Type.ARRAY;
+import static org.apache.avro.Schema.Type.MAP;
+import static org.apache.avro.Schema.Type.STRING;
+import static org.apache.avro.Schema.Type.UNION;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
+
+
+/**
+ * Implementation of avro generic record that uses casting for implicit schema 
evolution
+ */
+public class AvroCastingGenericRecord implements GenericRecord {
+  private final IndexedRecord record;
+  private final Schema readerSchema;
+  private final Map> fieldConverters;
+
+  private AvroCastingGenericRecord(IndexedRecord record, Schema readerSchema, 
Map> fieldConverters) {
+this.record = record;
+this.readerSchema = readerSchema;
+this.fieldConverters = fieldConverters;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public Object get(int i) {
+if (fieldConverters.containsKey(i)) {
+  return fieldConverters.get(i).apply(record.get(i));
+}
+return record.get(i);
+  }
+
+  @Override
+  public Schema getSchema() {
+return readerSchema;
+  }
+
+  @Override
+  public void put(String key, Object v) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+put(field.pos(), v);
+  }
+
+  @Override
+  public Object get(String key) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+return get(field.pos());
+  }
+

Review Comment:
   removed this class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357315793


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroCastingGenericRecord.java:
##
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.avro;
+
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.UnaryOperator;
+
+import static org.apache.avro.Schema.Type.ARRAY;
+import static org.apache.avro.Schema.Type.MAP;
+import static org.apache.avro.Schema.Type.STRING;
+import static org.apache.avro.Schema.Type.UNION;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
+
+
+/**
+ * Implementation of avro generic record that uses casting for implicit schema 
evolution
+ */

Review Comment:
   removed this class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1760193721

   
   ## CI report:
   
   * 09d2e02d34a1fb02596925aa6a0b1a04ec959d8e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20310)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357244776


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java:
##
@@ -19,47 +19,80 @@
 
 package org.apache.hudi.common.table.log;
 
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodiePayloadProps;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieClusteringException;
 import org.apache.hudi.io.storage.HoodieFileReader;
 
 import org.apache.avro.Schema;
 
 import java.io.IOException;
 import java.util.Iterator;
+import java.util.Map;
 import java.util.Properties;
 
-/**
- * Reads records from base file and merges any updates from log files and 
provides iterable over all records in the file slice.
- */
-public class HoodieFileSliceReader implements Iterator> {
+public class HoodieFileSliceReader extends LogFileIterator {
+  private Option> baseFileIterator;
+  private HoodieMergedLogRecordScanner scanner;
+  private Schema schema;
+  private Properties props;
 
-  private final Iterator> recordsIterator;
+  private TypedProperties payloadProps = new TypedProperties();
+  private Option> simpleKeyGenFieldsOpt;
+  Map records;
+  HoodieRecordMerger merger;
 
-  public static HoodieFileSliceReader getFileSliceReader(
-  Option baseFileReader, HoodieMergedLogRecordScanner 
scanner, Schema schema, Properties props, Option> 
simpleKeyGenFieldsOpt) throws IOException {
+  public HoodieFileSliceReader(Option baseFileReader,
+   HoodieMergedLogRecordScanner scanner, Schema 
schema, String preCombineField, HoodieRecordMerger merger,
+   Properties props, Option> 
simpleKeyGenFieldsOpt) throws IOException {
+super(scanner);
 if (baseFileReader.isPresent()) {
-  Iterator baseIterator = 
baseFileReader.get().getRecordIterator(schema);
-  while (baseIterator.hasNext()) {
-
scanner.processNextRecord(baseIterator.next().wrapIntoHoodieRecordPayloadWithParams(schema,
 props,
-simpleKeyGenFieldsOpt, scanner.isWithOperationField(), 
scanner.getPartitionNameOverride(), false, Option.empty()));
-  }
+  this.baseFileIterator = 
Option.of(baseFileReader.get().getRecordIterator(schema));
+} else {
+  this.baseFileIterator = Option.empty();
 }
-return new HoodieFileSliceReader(scanner.iterator());
+this.scanner = scanner;
+this.schema = schema;
+this.merger = merger;
+if (preCombineField != null) {
+  
payloadProps.setProperty(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
preCombineField);
+}
+this.props = props;
+this.simpleKeyGenFieldsOpt = simpleKeyGenFieldsOpt;
+this.records = scanner.getRecords();
   }
 
-  private HoodieFileSliceReader(Iterator> recordsItr) {
-this.recordsIterator = recordsItr;
+  private Boolean hasNextInternal() {

Review Comment:
   addressed in https://github.com/apache/hudi/pull/9774



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357245092


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroCastingGenericRecord.java:
##
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.avro;
+
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.UnaryOperator;
+
+import static org.apache.avro.Schema.Type.ARRAY;
+import static org.apache.avro.Schema.Type.MAP;
+import static org.apache.avro.Schema.Type.STRING;
+import static org.apache.avro.Schema.Type.UNION;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
+
+
+/**
+ * Implementation of avro generic record that uses casting for implicit schema 
evolution
+ */
+public class AvroCastingGenericRecord implements GenericRecord {
+  private final IndexedRecord record;
+  private final Schema readerSchema;
+  private final Map> fieldConverters;
+
+  private AvroCastingGenericRecord(IndexedRecord record, Schema readerSchema, 
Map> fieldConverters) {
+this.record = record;
+this.readerSchema = readerSchema;
+this.fieldConverters = fieldConverters;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public Object get(int i) {
+if (fieldConverters.containsKey(i)) {
+  return fieldConverters.get(i).apply(record.get(i));
+}
+return record.get(i);
+  }
+
+  @Override
+  public Schema getSchema() {
+return readerSchema;
+  }
+
+  @Override
+  public void put(String key, Object v) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+put(field.pos(), v);
+  }
+
+  @Override
+  public Object get(String key) {
+Schema.Field field = getSchema().getField(key);
+if (field == null) {
+  throw new AvroRuntimeException("Not a valid schema field: " + key);
+}
+return get(field.pos());
+  }
+
+
+  /**
+   * Avro schema evolution does not support promotion from numbers to string. 
This function returns true if
+   * it will be necessary to rewrite the record to support the evolution.
+   * NOTE: this does not determine whether the schema evolution is valid. It 
is just trying to find if the schema evolves from
+   * a number to string, as quick as possible.
+   */
+  public static Boolean 
recordNeedsRewriteForExtendedAvroSchemaEvolution(Schema writerSchema, Schema 
readerSchema) {

Review Comment:
   addressed in https://github.com/apache/hudi/pull/9774



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-12 Thread via GitHub


jonvex commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1357244322


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log;
+
+import java.util.Iterator;
+
+public abstract class CachingIterator implements Iterator {
+
+  protected T nextRecord;
+
+  protected abstract Boolean doHasNext();

Review Comment:
   Addressed in https://github.com/apache/hudi/pull/9774



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6872) Simplify Out Of Box Schema Evolution Functionality

2023-10-12 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6872:
--
Component/s: spark
 spark-sql
 (was: tests-ci)
Description: Test schema evolution capabilities out of the box for 
deltastreamer and datasource. Make schema evolution out of the box easy to 
understand and use  (was: Test schema evolution capabilities out of the box for 
deltastreamer)
 Issue Type: Improvement  (was: Test)
Summary: Simplify Out Of Box Schema Evolution Functionality  (was: Test 
OOB Schema Evolution for deltastreamer)

> Simplify Out Of Box Schema Evolution Functionality
> --
>
> Key: HUDI-6872
> URL: https://issues.apache.org/jira/browse/HUDI-6872
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Test schema evolution capabilities out of the box for deltastreamer and 
> datasource. Make schema evolution out of the box easy to understand and use



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1760085500

   
   ## CI report:
   
   * 09d2e02d34a1fb02596925aa6a0b1a04ec959d8e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20310)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1760063106

   
   ## CI report:
   
   * 09d2e02d34a1fb02596925aa6a0b1a04ec959d8e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6937]CopyOnWriteInsertHandler#consume cause clustering performance degradation [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9851:
URL: https://github.com/apache/hudi/pull/9851#issuecomment-1760062987

   
   ## CI report:
   
   * 3ef4ff72aa23c4eceac09e3d09743ff62b6e2cab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20309)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6937]CopyOnWriteInsertHandler#consume cause clustering performance degradation [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9851:
URL: https://github.com/apache/hudi/pull/9851#issuecomment-1760049140

   
   ## CI report:
   
   * 3ef4ff72aa23c4eceac09e3d09743ff62b6e2cab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1760048702

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * f544ec59661637f87d06a6c88eb5130493608e70 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20308)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Row writer optimization for bulk insert [hudi]

2023-10-12 Thread via GitHub


lokesh-lingarajan-0310 opened a new pull request, #9852:
URL: https://github.com/apache/hudi/pull/9852

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1759983338

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 908ae8d3f54adcd9ba24205c9d66add546079c58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20273)
 
   * f544ec59661637f87d06a6c88eb5130493608e70 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20308)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-12 Thread via GitHub


codope commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1357025981


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/KeySpec.java:
##
@@ -0,0 +1,36 @@
+/*
+ *
+ *  * Licensed to the Apache Software Foundation (ASF) under one or more
+ *  * contributor license agreements.  See the NOTICE file distributed with
+ *  * this work for additional information regarding copyright ownership.
+ *  * The ASF licenses this file to You under the Apache License, Version 2.0
+ *  * (the "License"); you may not use this file except in compliance with
+ *  * the License.  You may obtain a copy of the License at
+ *  *
+ *  *http://www.apache.org/licenses/LICENSE-2.0
+ *  *
+ *  * Unless required by applicable law or agreed to in writing, software
+ *  * distributed under the License is distributed on an "AS IS" BASIS,
+ *  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  * See the License for the specific language governing permissions and
+ *  * limitations under the License.
+ *
+ */
+
+package org.apache.hudi.common.table.log;
+
+import java.util.List;
+
+public interface KeySpec {

Review Comment:
   javadoc for the public interface



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/FullKeySpec.java:
##
@@ -0,0 +1,40 @@
+/*
+ *
+ *  * Licensed to the Apache Software Foundation (ASF) under one or more
+ *  * contributor license agreements.  See the NOTICE file distributed with
+ *  * this work for additional information regarding copyright ownership.
+ *  * The ASF licenses this file to You under the Apache License, Version 2.0
+ *  * (the "License"); you may not use this file except in compliance with
+ *  * the License.  You may obtain a copy of the License at
+ *  *
+ *  *http://www.apache.org/licenses/LICENSE-2.0
+ *  *
+ *  * Unless required by applicable law or agreed to in writing, software
+ *  * distributed under the License is distributed on an "AS IS" BASIS,
+ *  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  * See the License for the specific language governing permissions and
+ *  * limitations under the License.
+ *
+ */

Review Comment:
   Please fix this license comment block (double asterisk) here and other new 
classes



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/FileGroupReaderBasedParquetFileFormat.scala:
##
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import kotlin.NotImplementedError
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.engine.HoodieReaderContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieBaseFile, HoodieLogFile, 
HoodieRecord}
+import org.apache.hudi.common.table.read.HoodieFileGroupReader
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.{
+  HoodieBaseRelation, HoodieSparkUtils, HoodieTableSchema, HoodieTableState, 
MergeOnReadSnapshotRelation,
+  PartitionFileSliceMapping, SparkAdapterSupport, 
SparkFileFormatInternalRowReaderContext}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.isMetaField
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import java.util
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+class FileGroupReaderBasedParquetFileFormat(tableState: 
Broadcast[HoodieTableState],
+tableSchema: 
Broadcast[HoodieTableSchema],

Review Comment:
   Do we really 

[PR] [HUDI-6937]CopyOnWriteInsertHandler#consume cause clustering performance degradation [hudi]

2023-10-12 Thread via GitHub


ksmou opened a new pull request, #9851:
URL: https://github.com/apache/hudi/pull/9851

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6937) CopyOnWriteInsertHandler#consume will cause clustering performance degradation

2023-10-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6937:
-
Labels: pull-request-available  (was: )

> CopyOnWriteInsertHandler#consume will cause clustering performance degradation
> --
>
> Key: HUDI-6937
> URL: https://issues.apache.org/jira/browse/HUDI-6937
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Attachments: hudi-0.12-flamegraph.png, hudi-0.12-log.png, 
> hudi-0.14-flamegraph.png, hudi-0.14-log.png
>
>
> We upgraded Hudi from 0.12 to 0.14, and found that the offline clustering 
> performance dropped by half. We compared and analyzed this two versions of 
> flame graphs, found TypedProperties object instantiation takes too much time. 
> This will cause clustering performance degradation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1759969653

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 908ae8d3f54adcd9ba24205c9d66add546079c58 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20273)
 
   * f544ec59661637f87d06a6c88eb5130493608e70 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6480] Flink support non-blocking concurrency control [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9850:
URL: https://github.com/apache/hudi/pull/9850#issuecomment-1759954394

   
   ## CI report:
   
   * 72aebcc59f5ebebc64402dc8d1d9a491474b1dd0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20305)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6924] Fix hoodie table config not wok in table properties [hudi]

2023-10-12 Thread via GitHub


hudi-bot commented on PR #9836:
URL: https://github.com/apache/hudi/pull/9836#issuecomment-1759954273

   
   ## CI report:
   
   * c19e0177a8ab98a26370cc238c20b72bfb4abd7c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20307)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]

2023-10-12 Thread via GitHub


njalan commented on issue #9751:
URL: https://github.com/apache/hudi/issues/9751#issuecomment-1759783515

   @ad1happy2go Below are the list count for one spark streaming micro batch:
   bleow are top list opreations(**first line is list count**) for table with 
hudi 0.13.1 and metadata enabled:
   329 (hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/metadata/.hoodie/),
   229 (hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/),
   50 (hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/metadata/files/),
   42 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile/),
   33 (hive/warehouse/ods_xxx.db/testing_hudi13/),
   14 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/metadata/.hoodie/.temp/),
   10 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/.temp/20231010140342361/),
9 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/.temp/20231010140158325/),
7 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/metadata/.hoodie/.temp/20231010140509929/),
7 
(hive/warehouse/ods_xxx.db/testing_hudi13/.hoodie/metadata/.hoodie/.temp/20231010140342361/),
   
   bleow are top list opreations(**first line is list count**) for table with 
hudi 0.9 and metadata disabled:
   274 (hive/warehouse/ods_.db/testing_hudi09/.hoodie/),
   188 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.aux/.bootstrap/.partitions/-----0_1-0-1_01.hfile/),
48 (hive/warehouse/ods_.db/testing_hudi09/),
 9 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/20231010140501/),
 9 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/20231010140401/),
 9 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/20231010140301/),
 9 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/20231010140201/),
 9 
(hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/20231010140101/),
 5 (hive/warehouse/ods_.db/testing_hudi09/.hoodie/.temp/),
 5 (hive/warehouse/ods_.db/testing_hudi09/.hoodie/.heartbeat/),
   
   
   Is there any way the reduce the list operation? If one table can reduce 50% 
list operation it can reduce workload significantly where there are  thousands 
of of tables with local deployed object storage cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >