[GitHub] [hudi] KnightChess closed pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-09 Thread via GitHub


KnightChess closed pull request #9158: [MINOR] Unpersist only relevent metadata 
table RDDs
URL: https://github.com/apache/hudi/pull/9158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] KnightChess opened a new pull request, #9158: [MiNOR] Unpersist only relevent metadata table RDDs

2023-07-09 Thread via GitHub


KnightChess opened a new pull request, #9158:
URL: https://github.com/apache/hudi/pull/9158

   ### Change Logs
   
   #7914 only remove base table cached rdd, if open mdt, mdt cached rdd need 
unpersist too.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9157:
URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628279986

   
   ## CI report:
   
   * d3a74b072d6c1d55f5412946aa4ba75a8c54076a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18434)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1628105901

   
   ## CI report:
   
   * debb29aa63e665eed1137973fbe94eee395dc286 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18374)
 
   * e9fb2f74b5333e6f8f60de11ea6adce497252648 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18436)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8054: [HUDI-5854]Support flink table column comment

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #8054:
URL: https://github.com/apache/hudi/pull/8054#discussion_r1257693948


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java:
##
@@ -34,13 +35,21 @@
 import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY;
 
 public class SparkDataSourceTableUtils {
+
+  public static Map getSparkTableProperties(List 
partitionNames, String sparkVersion,
+int 
schemaLengthThreshold, MessageType schema) {
+return getSparkTableProperties(partitionNames, sparkVersion, 
schemaLengthThreshold, schema, Collections.emptyMap());
+  }
+
   /**
* Get Spark Sql related table properties. This is used for spark datasource 
table.
-   * @param schema  The schema to write to the table.
+   *
+   * @param schema The schema to write to the table.
* @return A new parameters added the spark's table properties.
*/
   public static Map getSparkTableProperties(List 
partitionNames, String sparkVersion,
-int 
schemaLengthThreshold, MessageType schema) {
+int 
schemaLengthThreshold, MessageType schema,
+Map commentMap) {

Review Comment:
   Where does the new method be used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8054: [HUDI-5854]Support flink table column comment

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #8054:
URL: https://github.com/apache/hudi/pull/8054#discussion_r1257693688


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HiveSchemaUtils.java:
##
@@ -176,45 +177,6 @@ private static DataType 
toFlinkPrimitiveType(PrimitiveTypeInfo hiveType) {
 }
   }
 
-  /**
-   * Create Hive field schemas from Flink table schema including the hoodie 
metadata fields.
-   */
-  public static List toHiveFieldSchema(TableSchema schema, 
boolean withOperationField) {
-List columns = new ArrayList<>();

Review Comment:
   Can we modify directly on these methods instead of removing and adding new 
one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1628092201

   
   ## CI report:
   
   * debb29aa63e665eed1137973fbe94eee395dc286 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18374)
 
   * e9fb2f74b5333e6f8f60de11ea6adce497252648 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9133: [HUDI-6474] Added support for reading tables evolved using comprehensive schema e…

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9133:
URL: https://github.com/apache/hudi/pull/9133#discussion_r1257689584


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/CastMap.java:
##
@@ -165,21 +192,132 @@ void add(int pos, LogicalType fromType, LogicalType 
toType) {
 }
 break;
   }
+  case ARRAY: {
+if (from == ARRAY) {
+  LogicalType fromElementType =  fromType.getChildren().get(0);
+  LogicalType toElementType = toType.getChildren().get(0);
+  return array -> doArrayConversion((ArrayData) array, 
fromElementType, toElementType);
+}
+break;
+  }
+  case MAP: {
+if (from == MAP) {
+  return map -> doMapConversion((MapData) map, fromType, toType);
+}
+break;
+  }
+  case ROW: {
+if (from == ROW) {
+  // Assumption: InternalSchemaManager should produce a cast that is 
of the same size
+  return row -> doRowConversion((RowData) row, fromType, toType);
+}
+break;
+  }
   default:
 }
-return null;
+throw new IllegalArgumentException(String.format("Unsupported conversion 
for %s => %s", fromType, toType));
   }
 
-  private void add(int pos, Cast cast) {
-castMap.put(pos, cast);
+  /**
+   * Helper function to perform convert an arrayData from one LogicalType to 
another.
+   *
+   * @param arrayNon-null array data to be converted; however 
array-elements are allowed to be null
+   * @param fromType The input LogicalType of the row data to be converted from
+   * @param toType   The output LogicalType of the row data to be converted to
+   * @return Converted array that has the structure/specifications of that 
defined by the output LogicalType
+   */
+  private static ArrayData doArrayConversion(@Nonnull ArrayData array, 
LogicalType fromType, LogicalType toType) {
+// using Object type here as primitives are not allowed to be null
+Object[] objects = new Object[array.size()];
+for (int i = 0; i < array.size(); i++) {
+  Object fromObject = 
ArrayData.createElementGetter(fromType).getElementOrNull(array, i);
+  // need to handle nulls to prevent NullPointerException in 
#getConversion()
+  Object toObject = fromObject != null ? getConversion(fromType, 
toType).apply(fromObject) : null;
+  objects[i] = toObject;
+}
+return new GenericArrayData(objects);
+  }
+
+  /**
+   * Helper function to perform convert a MapData from one LogicalType to 
another.
+   *
+   * @param map  Non-null map data to be converted; however, values are 
allowed to be null
+   * @param fromType The input LogicalType of the row data to be converted from
+   * @param toType   The output LogicalType of the row data to be converted to
+   * @return Converted map that has the structure/specifications of that 
defined by the output LogicalType
+   */
+  private static MapData doMapConversion(@Nonnull MapData map, LogicalType 
fromType, LogicalType toType) {
+// no schema evolution is allowed on the keyType, hence, we only need to 
care about the valueType
+LogicalType fromValueType = fromType.getChildren().get(1);
+LogicalType toValueType = toType.getChildren().get(1);
+LogicalType keyType = fromType.getChildren().get(0);
+
+final Map result = new HashMap<>();
+for (int i = 0; i < map.size(); i++) {
+  Object keyObject = 
ArrayData.createElementGetter(keyType).getElementOrNull(map.keyArray(), i);
+  Object fromObject = 
ArrayData.createElementGetter(fromValueType).getElementOrNull(map.valueArray(), 
i);
+  // need to handle nulls to prevent NullPointerException in 
#getConversion()
+  Object toObject = fromObject != null ? getConversion(fromValueType, 
toValueType).apply(fromObject) : null;
+  result.put(keyObject, toObject);
+}
+return new GenericMapData(result);
+  }
+
+  /**
+   * Helper function to perform convert a RowData from one LogicalType to 
another.
+   *
+   * @param row  Non-null row data to be converted; however, fields might 
contain nulls
+   * @param fromType The input LogicalType of the row data to be converted from
+   * @param toType   The output LogicalType of the row data to be converted to
+   * @return Converted row that has the structure/specifications of that 
defined by the output LogicalType
+   */
+  private static RowData doRowConversion(@Nonnull RowData row, LogicalType 
fromType, LogicalType toType) {
+// note: InternalSchema.merge guarantees that the schema to be read 
fromType is orientated in the same order as toType
+// hence, we can match types by position as it is guaranteed that it is 
referencing the same field
+List fromChildren = fromType.getChildren();
+List toChildren = toType.getChildren();
+ValidationUtils.checkArgument(fromChildren.size() == toChildren.size(),
+"fromType [" + fromType + "] size: != toType [" + toType + "] size");
+
+

[GitHub] [hudi] danny0405 commented on a diff in pull request #9133: [HUDI-6474] Added support for reading tables evolved using comprehensive schema e…

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9133:
URL: https://github.com/apache/hudi/pull/9133#discussion_r1255288699


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/CastMap.java:
##
@@ -165,21 +192,132 @@ void add(int pos, LogicalType fromType, LogicalType 
toType) {
 }
 break;
   }
+  case ARRAY: {
+if (from == ARRAY) {
+  LogicalType fromElementType =  fromType.getChildren().get(0);
+  LogicalType toElementType = toType.getChildren().get(0);
+  return array -> doArrayConversion((ArrayData) array, 
fromElementType, toElementType);
+}
+break;
+  }
+  case MAP: {
+if (from == MAP) {
+  return map -> doMapConversion((MapData) map, fromType, toType);
+}
+break;
+  }
+  case ROW: {
+if (from == ROW) {
+  // Assumption: InternalSchemaManager should produce a cast that is 
of the same size
+  return row -> doRowConversion((RowData) row, fromType, toType);
+}
+break;
+  }
   default:
 }
-return null;
+throw new IllegalArgumentException(String.format("Unsupported conversion 
for %s => %s", fromType, toType));

Review Comment:
   Do not throw `RuntimeException` in nested calling code path, it is very 
obscure for the invoker to get the perception of exceptions. Either throws a 
checked exception or return null as of before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9118: [HUDI-2141] Support flink write metrics

2023-07-09 Thread via GitHub


danny0405 commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1628084953

   Thanks for the contribution @stream2000 , can we add list a table of metrics 
we support here, illustrating the usage of each metrics, and then we are clear 
whether it deserves to introduce the metrics.
   
   Another metrics I got from feedback is the compaction progress metrics. Do 
we have plan to support that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


danny0405 commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628072273

   > because we can make sure that the partition records is sorted.
   
   In the near future, the non-sorted partition code may be introduced, we can 
keep it for extensibility now I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9126: [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9126:
URL: https://github.com/apache/hudi/pull/9126#discussion_r1257684369


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedAndBoundedIOCompactionStrategy.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.compact.strategy;
+
+import org.apache.hudi.avro.model.HoodieCompactionOperation;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class DayBasedAndBoundedIOCompactionStrategy extends 
DayBasedCompactionStrategy {
+
+  protected static Comparator logSizeComparator = 
(op1, op2) -> {
+Long logSize1 = op1.getMetrics().get(TOTAL_LOG_FILE_SIZE).longValue();

Review Comment:
   private static final Comparator 
LOG_SIZE_COMPARATOR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource

2023-07-09 Thread via GitHub


danny0405 commented on issue #9143:
URL: https://github.com/apache/hudi/issues/9143#issuecomment-1628055348

   cc @ad1happy2go seems this issue has been fixed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


stream2000 commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628043527

   > 
[6515.patch.zip](https://github.com/apache/hudi/files/11997469/6515.patch.zip) 
Nice catch, does this fix makes sense to you?
   
   Yes this fix is ok and make less modification. I make more changes because 
we can make sure that the partition records is sorted. I'm OK with both 
approaches, what do you think? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


danny0405 commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628023795

   
[6515.patch.zip](https://github.com/apache/hudi/files/11997469/6515.patch.zip)
   Nice catch, does this fix makes sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6316] Adding corrupted and rollback log blocks metrics (#8881)

2023-07-09 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a0ca3eadba5 [HUDI-6316] Adding corrupted and rollback log blocks 
metrics (#8881)
a0ca3eadba5 is described below

commit a0ca3eadba5a041aaaf97d040adffc017cfdd285
Author: Sivabalan Narayanan 
AuthorDate: Sun Jul 9 23:10:43 2023 -0400

[HUDI-6316] Adding corrupted and rollback log blocks metrics (#8881)

Adding log block metrics to track corrupted lock blocks and rollback blocks.
Users need to enable `hoodie.metrics.compaction.log.blocks.on`
to enable the metrics.
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  7 ++
 .../hudi/config/metrics/HoodieMetricsConfig.java   | 11 +++
 .../org/apache/hudi/metrics/HoodieMetrics.java | 79 ++
 .../org/apache/hudi/metrics/TestHoodieMetrics.java | 41 ++-
 .../hudi/common/model/HoodieCommitMetadata.java| 20 ++
 5 files changed, 113 insertions(+), 45 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 93105491180..1390408d901 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -2088,6 +2088,13 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getBoolean(HoodieMetricsConfig.TURN_METRICS_ON);
   }
 
+  /**
+   * metrics properties.
+   */
+  public boolean isCompactionLogBlockMetricsOn() {
+return 
getBoolean(HoodieMetricsConfig.TURN_METRICS_COMPACTION_LOG_BLOCKS_ON);
+  }
+
   public boolean isExecutorMetricsEnabled() {
 return Boolean.parseBoolean(
 getStringOrDefault(HoodieMetricsConfig.EXECUTOR_METRICS_ENABLE, 
"false"));
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
index 9fe9b33a546..e1d0afeb6fa 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
@@ -106,6 +106,12 @@ public class HoodieMetricsConfig extends HoodieConfig {
   .sinceVersion("0.14.0")
   .withDocumentation("Comma separated list of config file paths for metric 
exporter configs");
 
+  public static final ConfigProperty 
TURN_METRICS_COMPACTION_LOG_BLOCKS_ON = ConfigProperty
+  .key(METRIC_PREFIX + "compaction.log.blocks.on")
+  .defaultValue(false)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Turn on/off metrics reporting for log blocks with 
compaction commit. off by default.");
+
   /**
* @deprecated Use {@link #TURN_METRICS_ON} and its methods instead
*/
@@ -171,6 +177,11 @@ public class HoodieMetricsConfig extends HoodieConfig {
   return this;
 }
 
+public Builder compactionLogBlocksEnable(boolean 
compactionLogBlockMetricsEnable) {
+  hoodieMetricsConfig.setValue(TURN_METRICS_COMPACTION_LOG_BLOCKS_ON, 
String.valueOf(compactionLogBlockMetricsEnable));
+  return this;
+}
+
 public Builder withReporterType(String reporterType) {
   hoodieMetricsConfig.setValue(METRICS_REPORTER_TYPE_VALUE, reporterType);
   return this;
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
index dac680a5c40..792d0cd0844 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
@@ -37,6 +37,23 @@ public class HoodieMetrics {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieMetrics.class);
 
+  public static final String TOTAL_PARTITIONS_WRITTEN_STR = 
"totalPartitionsWritten";
+  public static final String TOTAL_FILES_INSERT_STR = "totalFilesInsert";
+  public static final String TOTAL_FILES_UPDATE_STR = "totalFilesUpdate";
+  public static final String TOTAL_RECORDS_WRITTEN_STR = "totalRecordsWritten";
+  public static final String TOTAL_UPDATE_RECORDS_WRITTEN_STR = 
"totalUpdateRecordsWritten";
+  public static final String TOTAL_INSERT_RECORDS_WRITTEN_STR = 
"totalInsertRecordsWritten";
+  public static final String TOTAL_BYTES_WRITTEN_STR = "totalBytesWritten";
+  public static final String TOTAL_SCAN_TIME_STR = "totalScanTime";
+  public static final String TOTAL_CREATE_TIME_STR = "totalCreateTime";
+  public static final String 

[GitHub] [hudi] codope merged pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics

2023-07-09 Thread via GitHub


codope merged PR #8881:
URL: https://github.com/apache/hudi/pull/8881


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9157:
URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628014896

   
   ## CI report:
   
   * d3a74b072d6c1d55f5412946aa4ba75a8c54076a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18434)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9138: [SUPPORT]Read Hive then write Hudi,but org.apache.hadoop.hive.ql.metadata.AuthorizationException: No privilege 'Update' found for outputs

2023-07-09 Thread via GitHub


danny0405 commented on issue #9138:
URL: https://github.com/apache/hudi/issues/9138#issuecomment-1628004963

   Seems an issue of privilege, did you try to grant the permission with cmd:
   `grant xxx_action on database xxx_database to user xxx_user;`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9157:
URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628004052

   
   ## CI report:
   
   * d3a74b072d6c1d55f5412946aa4ba75a8c54076a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1628003363

   
   ## CI report:
   
   * 1fc455198964e104536b807792dde341c960de88 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18433)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9154: [SUPPORT] org.apache.hudi.exception.HoodieIOException: Unable to create

2023-07-09 Thread via GitHub


danny0405 commented on issue #9154:
URL: https://github.com/apache/hudi/issues/9154#issuecomment-1628001354

   Did you configure the spillable map path explicitly? Wondering if it is 
related with the disk space or something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8881:
URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627996896

   
   ## CI report:
   
   * aa126962099394261801fe5fbf679bd192033f80 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18431)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257660429


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   Yeah, kind of obscure from the first sight, why not just put the default 
value `hoodie.merge.allow.duplicate.on.inserts` as true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9153: [HUDI-6512] Check whether exists write error before commit compaction

2023-07-09 Thread via GitHub


danny0405 commented on code in PR #9153:
URL: https://github.com/apache/hudi/pull/9153#discussion_r1257656038


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java:
##
@@ -291,6 +287,13 @@ private void 
completeClustering(HoodieReplaceCommitMetadata metadata,
 LOG.info("Clustering successfully on commit " + clusteringCommitTime);
   }
 
+  private void handleWriteErrors(List writeStats, 
TableServiceType tableServiceType) {
+if 
(writeStats.stream().mapToLong(HoodieWriteStat::getTotalWriteErrors).sum() > 0) 
{
+  throw new HoodieClusteringException(tableServiceType + " failed to write 
to files:"
+  + writeStats.stream().filter(s -> s.getTotalWriteErrors() > 
0L).map(HoodieWriteStat::getFileId).collect(Collectors.joining(",")));

Review Comment:
   wondering whether we need a config option to control the behavior? If all 
the exceptions are resolvable, it's okay we throw exceptions directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #9157: [HUDI-6246] Adding tests to validate restore of compaction commit

2023-07-09 Thread via GitHub


nsivabalan opened a new pull request, #9157:
URL: https://github.com/apache/hudi/pull/9157

   ### Change Logs
   
   Adding tests to validate restore of compaction commit
   
   ### Impact
   
   Adding tests to validate restore of compaction commit
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627893231

   
   ## CI report:
   
   * beb523cca98b4b62964c485e464d5eb1ce6e25a2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18432)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1627870285

   
   ## CI report:
   
   * 6250cd0f2bfe2ba9c3b3053940f2a75be78c2f98 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17041)
 
   * 1fc455198964e104536b807792dde341c960de88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18433)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1627867274

   
   ## CI report:
   
   * 6250cd0f2bfe2ba9c3b3053940f2a75be78c2f98 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17041)
 
   * 1fc455198964e104536b807792dde341c960de88 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #8697:
URL: https://github.com/apache/hudi/pull/8697#discussion_r1257563336


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -142,24 +144,27 @@ trait ProvidesHoodieConfig extends Logging {
 //   we'd prefer that value over auto-deduced operation. Otherwise, we 
deduce target operation type
 val operationOverride = 
combinedOpts.get(DataSourceWriteOptions.OPERATION.key)
 val operation = operationOverride.getOrElse {
-  (enableBulkInsert, isOverwritePartition, isOverwriteTable, 
dropDuplicate, isNonStrictMode, isPartitionedTable) match {
-case (true, _, _, _, false, _) =>
+  (enableBulkInsert, isOverwritePartition, isOverwriteTable, 
dropDuplicate, isNonStrictMode, isPartitionedTable,
+  autoGenerateRecordKeys) match {
+case (true, _, _, _, false, _, _) =>
   throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert in ${insertMode.value()} mode.")
-case (true, true, _, _, _, true) =>
+case (true, true, _, _, _, true, _) =>
   throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
-case (true, _, _, true, _, _) =>
+case (true, _, _, true, _, _, _) =>
   throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
 s" Please disable $INSERT_DROP_DUPS and try again.")
 // if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
-case (true, false, true, _, _, false) => BULK_INSERT_OPERATION_OPT_VAL
+case (true, false, true, _, _, false, _) => 
BULK_INSERT_OPERATION_OPT_VAL
 // insert overwrite table
-case (false, false, true, _, _, _) => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+case (false, false, true, _, _, _, _) => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
 // insert overwrite partition
-case (_, true, false, _, _, true) => INSERT_OVERWRITE_OPERATION_OPT_VAL
+case (_, true, false, _, _, true, _) => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
 // disable dropDuplicate, and provide preCombineKey, use the upsert 
operation for strict and upsert mode.
-case (false, false, false, false, false, _) if hasPrecombineColumn => 
UPSERT_OPERATION_OPT_VAL
+case (false, false, false, false, false, _, _) if hasPrecombineColumn 
=> UPSERT_OPERATION_OPT_VAL
 // if table is pk table and has enableBulkInsert use bulk insert for 
non-strict mode.
-case (true, _, _, _, true, _) => BULK_INSERT_OPERATION_OPT_VAL
+case (true, _, _, _, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// if auto record key generation is enabled, use bulk_insert
+case (_, _, _, _, _, true, true) => BULK_INSERT_OPERATION_OPT_VAL

Review Comment:
   fixed it. 



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -402,6 +389,40 @@ object HoodieSparkSqlWriter {
 }
   }
 
+  def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : 
Map[String, String]): WriteOperationType = {
+var operation = 
WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
+// TODO clean up

Review Comment:
   nope. I just copied the comments over. not really sure



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -402,6 +389,40 @@ object HoodieSparkSqlWriter {
 }
   }
 
+  def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : 
Map[String, String]): WriteOperationType = {
+var operation = 
WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
+// TODO clean up
+// It does not make sense to allow upsert() operation if INSERT_DROP_DUPS 
is true
+// Auto-correct the operation to "insert" if OPERATION is set to "upsert" 
wrongly
+// or not set (in which case it will be set as "upsert" by 
parametersWithWriteDefaults()) .
+if (hoodieConfig.getBoolean(INSERT_DROP_DUPS) &&
+  operation == WriteOperationType.UPSERT) {
+
+  log.warn(s"$UPSERT_OPERATION_OPT_VAL is not applicable " +
+s"when $INSERT_DROP_DUPS is set to be true, " +
+s"overriding the $OPERATION to be $INSERT_OPERATION_OPT_VAL")
+
+  operation = WriteOperationType.INSERT
+}
+
+// if no record key, no preCombine and no explicit partition path is set, 
we should treat it as append only workload

Review Comment:
   we are being conservative here. Major motivation here is to target users 
coming from parquet table. They would be writing to parquet table as 
df.write.format("parquet").save(basePath). 
   when they want to use hudi, table name is mandatory, but everything else if 
optional. 
   
   So, all they need to do is, 
   

[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627855511

   
   ## CI report:
   
   * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325)
 
   * beb523cca98b4b62964c485e464d5eb1ce6e25a2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18432)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8881:
URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627855419

   
   ## CI report:
   
   * 7b9d3ca6b52733dc531f54d6bf21ca449cc3254f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17592)
 
   * aa126962099394261801fe5fbf679bd192033f80 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18431)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627853487

   
   ## CI report:
   
   * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325)
 
   * beb523cca98b4b62964c485e464d5eb1ce6e25a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics

2023-07-09 Thread via GitHub


hudi-bot commented on PR #8881:
URL: https://github.com/apache/hudi/pull/8881#issuecomment-162785

   
   ## CI report:
   
   * 7b9d3ca6b52733dc531f54d6bf21ca449cc3254f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17592)
 
   * aa126962099394261801fe5fbf679bd192033f80 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics

2023-07-09 Thread via GitHub


nsivabalan commented on PR #8881:
URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627851978

   @codope : done. addressed the feedback


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627844217

   hey @danny0405 @codope : Updated the patch. rebased w/ latest master. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257549916


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -187,29 +218,38 @@ trait ProvidesHoodieConfig extends Logging {
 
 val insertModeOpt = combinedOpts.get(SQL_INSERT_MODE.key)
 val insertModeSet = insertModeOpt.nonEmpty
+val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key())
+val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty
+val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue())
+val insertDupPolicy = combinedOpts.getOrElse(INSERT_DUP_POLICY.key(), 
INSERT_DUP_POLICY.defaultValue())
 val insertMode = 
InsertMode.of(insertModeOpt.getOrElse(SQL_INSERT_MODE.defaultValue()))
 val isNonStrictMode = insertMode == InsertMode.NON_STRICT
 val isPartitionedTable = hoodieCatalogTable.partitionFields.nonEmpty
 val combineBeforeInsert = hoodieCatalogTable.preCombineKey.nonEmpty && 
hoodieCatalogTable.primaryKeys.nonEmpty
 
-// NOTE: Target operation could be overridden by the user, therefore if it 
has been provided as an input
-//   we'd prefer that value over auto-deduced operation. Otherwise, we 
deduce target operation type
+// try to use sql write operation instead of legacy insert mode. If only 
insert mode is explicitly specified, we will uze
+// o
+val useLegacyInsertModeFlow = insertModeSet && !sqlWriteOperationSet
 val operation = combinedOpts.getOrElse(OPERATION.key,
+  if (useLegacyInsertModeFlow) {
+// NOTE: Target operation could be overridden by the user, therefore 
if it has been provided as an input
+//   we'd prefer that value over auto-deduced operation. 
Otherwise, we deduce target operation type
   deduceWriteOperationForInsertInfo(isPartitionedTable, 
isOverwritePartition, isOverwriteTable, insertModeSet, dropDuplicate,
-enableBulkInsert, isInsertInto, isNonStrictMode, combineBeforeInsert))
+enableBulkInsert, isInsertInto, isNonStrictMode, combineBeforeInsert)
+  } else {
+deduceSqlWriteOperation(isOverwritePartition, isOverwriteTable, 
sqlWriteOperation)
+  }
+)
 
-val payloadClassName = if (operation == UPSERT_OPERATION_OPT_VAL &&
-  tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) {
-  // Validate duplicate key for COW, for MOR it will do the merge with the 
DefaultHoodieRecordPayload
-  // on reading.
-  // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when 
SparkRecordMerger is default
-  classOf[ValidateDuplicateKeyPayload].getCanonicalName
-} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == 
COW_TABLE_TYPE_OPT_VAL &&
-  insertMode == InsertMode.STRICT){
-  // Validate duplicate key for inserts to COW table when using strict 
insert mode.
-  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+val payloadClassName =  if (useLegacyInsertModeFlow) {
+  deducePayloadClassNameLegacy(operation, tableType, insertMode)
 } else {
-  classOf[OverwriteWithLatestAvroPayload].getCanonicalName
+  // should we also consider old way of doing things.

Review Comment:
   this is already taken care in deducePayloadClassNameLegacy, none of the 
downstream methods do anything differently. Its only used to deduce the payload 
class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257548654


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   this is not de-dup. this is actually achieving what you are claiming Danny. 
   i.e. 
   if you ingest RK1, val1 in commit and RK1, val2 in commit2 with insert 
operation type, snapshot will return both values only when you set 
"MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key" = true. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257546507


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging {
 }
   }
 
+  private def deducePayloadClassNameLegacy(operation: String, tableType: 
String, insertMode: InsertMode): String = {
+if (operation == UPSERT_OPERATION_OPT_VAL &&
+  tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) {
+  // Validate duplicate key for COW, for MOR it will do the merge with the 
DefaultHoodieRecordPayload
+  // on reading.
+  // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when 
SparkRecordMerger is default
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == 
COW_TABLE_TYPE_OPT_VAL &&
+  insertMode == InsertMode.STRICT){
+  // Validate duplicate key for inserts to COW table when using strict 
insert mode.
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else {
+  classOf[OverwriteWithLatestAvroPayload].getCanonicalName
+}

Review Comment:
   yes, thats why I also did not switch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257546160


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -187,29 +218,38 @@ trait ProvidesHoodieConfig extends Logging {
 
 val insertModeOpt = combinedOpts.get(SQL_INSERT_MODE.key)
 val insertModeSet = insertModeOpt.nonEmpty
+val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key())
+val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty
+val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue())
+val insertDupPolicy = combinedOpts.getOrElse(INSERT_DUP_POLICY.key(), 
INSERT_DUP_POLICY.defaultValue())
 val insertMode = 
InsertMode.of(insertModeOpt.getOrElse(SQL_INSERT_MODE.defaultValue()))
 val isNonStrictMode = insertMode == InsertMode.NON_STRICT
 val isPartitionedTable = hoodieCatalogTable.partitionFields.nonEmpty
 val combineBeforeInsert = hoodieCatalogTable.preCombineKey.nonEmpty && 
hoodieCatalogTable.primaryKeys.nonEmpty
 
-// NOTE: Target operation could be overridden by the user, therefore if it 
has been provided as an input
-//   we'd prefer that value over auto-deduced operation. Otherwise, we 
deduce target operation type
+// try to use sql write operation instead of legacy insert mode. If only 
insert mode is explicitly specified, we will uze
+// o
+val useLegacyInsertModeFlow = insertModeSet && !sqlWriteOperationSet

Review Comment:
   Nope, we should call it out in our docs that with spark sql INSERT_INTO, 
users are expected to set this new config. If 
DataSourceWriteOperations.OPERATION, it may not be honored



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-09 Thread via GitHub


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1257545983


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -514,6 +520,29 @@ object DataSourceWriteOptions {
 
   val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = 
HoodieCommonConfig.RECONCILE_SCHEMA
 
+  val SQL_WRITE_OPERATION: ConfigProperty[String] = ConfigProperty
+.key("hoodie.sql.write.operation")
+.defaultValue("insert")

Review Comment:
   this is mainly used w/ INSERT_INTO. So, delete does not make sense. 



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -514,6 +520,29 @@ object DataSourceWriteOptions {
 
   val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = 
HoodieCommonConfig.RECONCILE_SCHEMA
 
+  val SQL_WRITE_OPERATION: ConfigProperty[String] = ConfigProperty
+.key("hoodie.sql.write.operation")
+.defaultValue("insert")
+.withDocumentation("Sql write operation to use with INSERT_INTO spark sql 
command. This comes with 3 possible values, bulk_insert, " +
+  "insert and upsert. bulk_insert is generally meant for initial loads and 
is known to be performant compared to insert. But bulk_insert may not " +
+  "do small file managmeent. If you prefer hudi to automatically managee 
small files, then you can go with \"insert\". There is no precombine " +
+  "(if there are duplicates within the same batch being ingested, same 
dups will be ingested) with bulk_insert and insert and there is no index " +
+  "look up as well. If you may use INSERT_INTO for mutable dataset, then 
you may have to set this config value to \"upsert\". With upsert, you will " +
+  "get both precombine and updates to existing records on storage is also 
honored. If not, you may see duplicates. ")
+
+  val NONE_INSERT_DUP_POLICY = "none"
+  val DROP_INSERT_DUP_POLICY = "drop"
+  val FAIL_INSERT_DUP_POLICY = "fail"
+
+  val INSERT_DUP_POLICY: ConfigProperty[String] = ConfigProperty
+.key("hoodie.datasource.insert.dup.policy")
+.defaultValue(NONE_INSERT_DUP_POLICY)

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index (#9055)

2023-07-09 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6d3311c34c8 [HUDI-6443] Support delete_partition, 
insert_overwrite/table with record-level index (#9055)
6d3311c34c8 is described below

commit 6d3311c34c8548339aee8c3708fe50855a04afb8
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Mon Jul 10 01:24:44 2023 +0800

[HUDI-6443] Support delete_partition, insert_overwrite/table with 
record-level index (#9055)

- Support delete_partition, insert_overwrite and insert_overwrite_table 
with record-level index. The metadata records should be updated accordingly.

all records in the deleted partition(s) should be deleted from RLI (for 
delete_partition operation)
newly inserted records should be present in RLI
old records in the affected partitions should be removed from RLI
old records that happen to have the same record key as the new inserts 
won't be removed from RLI; their entries will be updated
-

Co-authored-by: sivabalan 
---
 .../metadata/HoodieBackedTableMetadataWriter.java  | 65 ++
 .../hudi/functional/TestRecordLevelIndex.scala | 28 +-
 2 files changed, 80 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 97d1ce5e8b2..df73145a1bb 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -42,7 +42,9 @@ import org.apache.hudi.common.model.HoodiePartitionMetadata;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieRecordDelegate;
 import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
 import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.log.HoodieLogFormat;
 import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
@@ -477,7 +479,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 + partitions.size() + " partitions");
 
 // Collect record keys from the files in parallel
-HoodieData records = 
readRecordKeysFromBaseFiles(engineContext, partitionBaseFilePairs);
+HoodieData records = 
readRecordKeysFromBaseFiles(engineContext, partitionBaseFilePairs, false);
 records.persist("MEMORY_AND_DISK_SER");
 final long recordCount = records.count();
 
@@ -495,7 +497,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
* Read the record keys from base files in partitions and return records.
*/
   private HoodieData 
readRecordKeysFromBaseFiles(HoodieEngineContext engineContext,
-   
List> partitionBaseFilePairs) {
+   
List> partitionBaseFilePairs,
+   boolean 
forDelete) {
 if (partitionBaseFilePairs.isEmpty()) {
   return engineContext.emptyHoodieData();
 }
@@ -524,8 +527,9 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 
 @Override
 public HoodieRecord next() {
-  return 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileId,
-  instantTime);
+  return forDelete
+  ? 
HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next())
+  : 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileId, instantTime);
 }
   };
 });
@@ -872,9 +876,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 
   // Updates for record index are created by parsing the WriteStatus which 
is a hudi-client object. Hence, we cannot yet move this code
   // to the HoodieTableMetadataUtil class in hudi-common.
-  if (writeStatus != null && !writeStatus.isEmpty()) {
-partitionToRecordMap.put(MetadataPartitionType.RECORD_INDEX, 
getRecordIndexUpdates(writeStatus));
-  }
+  HoodieData updatesFromWriteStatuses = 
getRecordIndexUpdates(writeStatus);
+  HoodieData additionalUpdates = 
getRecordIndexAdditionalUpdates(updatesFromWriteStatuses, commitMetadata);
+  partitionToRecordMap.put(MetadataPartitionType.RECORD_INDEX, 

[GitHub] [hudi] nsivabalan merged pull request #9055: [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index

2023-07-09 Thread via GitHub


nsivabalan merged PR #9055:
URL: https://github.com/apache/hudi/pull/9055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] huyuanfeng2018 commented on issue #8265: [SUPPORT] Flink Table planner not loading problem

2023-07-09 Thread via GitHub


huyuanfeng2018 commented on issue #8265:
URL: https://github.com/apache/hudi/issues/8265#issuecomment-1627766770

   > > flink-table-planner has been removed from the classpath since flink1.15
   > 
   > Then how the classes in flink-table-planner is loaded then, can we do 
similiar loadings in Hudi side?
   
   This logic is executed when flink graph is built, not in jobmanager (except 
application-mode). The classpath running hudi-related logic does not load 
table-planner by default, because table-planner passes flink-table since 
flink1.15 -planner-loader dynamically loads flink-table-planner at runtime 
(because table-planner is not necessarily bound to blink), obviously hudi binds 
the sort implementation of blink in the code, so I think hudi may need to try 
to copy SortCodeGenerator related logic (I am not sure)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jlloh commented on issue #8191: [BUG] Unable to execute HTTP request | connection timeout issues

2023-07-09 Thread via GitHub


jlloh commented on issue #8191:
URL: https://github.com/apache/hudi/issues/8191#issuecomment-1627759070

   Seeing something similar for Flink 1.16 with Hudi 0.13.1, COW insert with 
metadata enabled. Problem seems to occur ~4 hours after the job has been 
running. The job is an inline clustering job. After disabling metadata, the job 
is able to proceed.
   
   Configurations:
   ```
   "table.table": "COPY_ON_WRITE"
   "write.operation": "insert"
   "write.insert.cluster": "true"
   "hoodie.datasource.write.hive_style_partitioning": "true"
   "metadata.enabled": "true"
   "hoodie.datasource.write.hive_style_partitioning": "true"
   "hoodie.parquet.max.file.size": "104857600"
   "hoodie.parquet.small.file.limit": "20971520"
   "clustering.plan.strategy.small.file.limit": "100"
   ```
   Files:
   ~211 parquet files per partition across 4 hourly partitions when the issue 
started happening and the job failed to continue. The bucket assigner task is 
the one that hits this error. I have tried both hourly and daily partitions but 
both jobs seem to eventually fail and not able to recover with metadata enabled.
   
   Full stacktrace:
   ```
   org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve files 
in partition 
s3a:///folder_name/local_year=2023/local_month=07/local_day=08 
from metadata
at 
org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:152)
at 
org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:69)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$16(AbstractTableFileSystemView.java:432)
at 
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:423)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestBaseFilesBeforeOrOn(AbstractTableFileSystemView.java:660)
at 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:104)
at 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:145)
at 
org.apache.hudi.sink.partitioner.profile.WriteProfile.smallFilesProfile(WriteProfile.java:208)
at 
org.apache.hudi.sink.partitioner.profile.WriteProfile.getSmallFiles(WriteProfile.java:191)
at 
org.apache.hudi.sink.partitioner.BucketAssigner.getSmallFileAssign(BucketAssigner.java:179)
at 
org.apache.hudi.sink.partitioner.BucketAssigner.addInsert(BucketAssigner.java:137)
at 
org.apache.hudi.sink.partitioner.BucketAssignFunction.getNewRecordLocation(BucketAssignFunction.java:215)
at 
org.apache.hudi.sink.partitioner.BucketAssignFunction.processRecord(BucketAssignFunction.java:200)
at 
org.apache.hudi.sink.partitioner.BucketAssignFunction.processElement(BucketAssignFunction.java:162)
at 
org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
log file 
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:374)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223)
at 

[jira] [Updated] (HUDI-6515) Bucket index row writer write record to wrong handle

2023-07-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6515:
-
Labels: pull-request-available  (was: )

> Bucket index row writer write record to wrong handle
> 
>
> Key: HUDI-6515
> URL: https://issues.apache.org/jira/browse/HUDI-6515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Critical
>  Labels: pull-request-available
>
> Bucket index row writer write record to wrong handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627738046

   
   ## CI report:
   
   * 4416219014959fe02290bf41695abb70c31c1715 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5759) Hudi do not support add column on mor table with log

2023-07-09 Thread Qijun Fu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qijun Fu closed HUDI-5759.
--
Fix Version/s: 0.13.0
   Resolution: Fixed

> Hudi do not support add column on mor table with log
> 
>
> Key: HUDI-5759
> URL: https://issues.apache.org/jira/browse/HUDI-5759
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 0.13.0
>
>
> We test the following sqls in the latest master branch 
> ```sql
> create table h0 (
>   id int,
>   name string,
>   price double,
>   ts long
> ) using hudi
>  options (
>   primaryKey ='id',
>   type = 'mor',
>   preCombineField = 'ts'
>  )
>  partitioned by(ts)
>  location '/tmp/h0';
> insert into h0 select 1, 'a1', 10, 1000;
> update h0 set price = 20 where id = 1;
> alter table h0 add column new_col1 int;
> update h0 set price = 22 where id = 1;
> select * from h0;
> ```
> And found that we can't read the table after add column and update. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6336) Support TimelineBased Checkpoint Metadata for flink

2023-07-09 Thread Qijun Fu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qijun Fu reassigned HUDI-6336:
--

Assignee: Qijun Fu

> Support TimelineBased Checkpoint Metadata for flink
> ---
>
> Key: HUDI-6336
> URL: https://issues.apache.org/jira/browse/HUDI-6336
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Major
>
> To reduce oss list qps in flink write, we can use timeline server to serve 
> the ckp metadata request



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6337) Incremental Clean skip fetch commit metadata for append mode

2023-07-09 Thread Qijun Fu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qijun Fu closed HUDI-6337.
--
Resolution: Won't Fix

> Incremental Clean skip fetch commit metadata for append mode 
> -
>
> Key: HUDI-6337
> URL: https://issues.apache.org/jira/browse/HUDI-6337
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
>
> Incremental Clean skip fetch commit metadata for append mode. 
>  
> In append mode we don't need to clean the data file at all because there are 
> always only one version for one file group. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6515) Bucket index row writer write record to wrong handle

2023-07-09 Thread Qijun Fu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qijun Fu reassigned HUDI-6515:
--

Assignee: Qijun Fu

> Bucket index row writer write record to wrong handle
> 
>
> Key: HUDI-6515
> URL: https://issues.apache.org/jira/browse/HUDI-6515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Critical
>
> Bucket index row writer write record to wrong handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627697564

   
   ## CI report:
   
   * 4416219014959fe02290bf41695abb70c31c1715 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


hudi-bot commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627695267

   
   ## CI report:
   
   * 4416219014959fe02290bf41695abb70c31c1715 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


stream2000 commented on code in PR #9156:
URL: https://github.com/apache/hudi/pull/9156#discussion_r1257466200


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala:
##
@@ -1247,6 +1247,27 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
   Seq(3, "a3,3", 30.0, 3000, "2021-01-07"),
   Seq(3, "a3", 30.0, 3000, "2021-01-07")
 )
+
+// there are two files in partition(dt = '2021-01-05')
+checkAnswer(s"select count(distinct _hoodie_file_name) from 
$tableName where dt = '2021-01-05'")(
+  Seq(2)
+)
+
+// would generate 6 other files in partition(dt = '2021-01-05')
+spark.sql(
+  s"""
+ | insert into $tableName values
+ | (4, 'a1,1', 10, 1000, "2021-01-05"),
+ | (5, 'a1,1', 10, 1000, "2021-01-05"),
+ | (6, 'a1,1', 10, 1000, "2021-01-05"),
+ | (7, 'a1,1', 10, 1000, "2021-01-05"),
+ | (8, 'a1,1', 10, 1000, "2021-01-05"),
+ | (9, 'a3,3', 30, 3000, "2021-01-05")
+   """.stripMargin)
+
+checkAnswer(s"select count(distinct _hoodie_file_name) from 
$tableName where dt = '2021-01-05'")(
+  Seq(8)

Review Comment:
   The result will be 3 without this change. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle

2023-07-09 Thread via GitHub


stream2000 commented on code in PR #9156:
URL: https://github.com/apache/hudi/pull/9156#discussion_r1257466066


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java:
##
@@ -91,27 +90,20 @@ private UTF8String extractRecordKey(InternalRow row) {
 }
   }
 
-  protected HoodieRowCreateHandle getBucketRowCreateHandle(String fileId, int 
bucketId) throws Exception {
-if (!handles.containsKey(fileId)) { // if there is no handle corresponding 
to the fileId

Review Comment:
   The fileid passed here is actually partitonPath,  so 
`handles.containsKey(fileId)` is always true for record in the same partition 
thus no new handle will be created. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 opened a new pull request, #9156: [Hudi-6515] Fix bucket index row writer

2023-07-09 Thread via GitHub


stream2000 opened a new pull request, #9156:
URL: https://github.com/apache/hudi/pull/9156

   ### Change Logs
   
   Fix bucket index row writer would write record to wrong handle and remove 
some useless  code since the partition record is sorted.
   
   ### Impact
   
   Fix bucket index row writer would write record to wrong handle
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   NONE
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6515) Bucket index row writer write record to wrong handle

2023-07-09 Thread Qijun Fu (Jira)
Qijun Fu created HUDI-6515:
--

 Summary: Bucket index row writer write record to wrong handle
 Key: HUDI-6515
 URL: https://issues.apache.org/jira/browse/HUDI-6515
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Qijun Fu


Bucket index row writer write record to wrong handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] parisni commented on issue #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource

2023-07-09 Thread via GitHub


parisni commented on issue #9026:
URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627681838

   @yihua 
   >  If the metadata table is queried through Spark datasource directly after 
MDT compaction (i.e., no additional log file in the latest file slice), there 
is no duplicate.
   
   Did you add new partition during that step ? It turns out the duplication 
occurs when new partitions are added after compaction. see below: when no new 
partitions, no duplication. When new partitions, then it gets tons of 
duplicates. 
   ```python
   sc.setLogLevel("ERROR")
   tableName = 'test_corrupted_mdt'
   basePath = "/tmp/{tableName}".format(tableName=tableName)
   
   hudi_options = {
   "hoodie.table.name": tableName,
   "hoodie.datasource.write.recordkey.field": "event_id",
   "hoodie.datasource.write.partitionpath.field": "part",
   "hoodie.datasource.write.table.name": tableName,
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.datasource.write.precombine.field": "ts",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.datasource.hive_sync.enable": "false",
   "hoodie.metadata.enable": "true",
   }
   mode="overwrite"
   for i in range(1,22):
   df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as part") 
# <-- W/ adding new partitions
   # df =spark.sql("select '1' as event_id, '2' as ts, '2' as part") <-- 
W/O adding new partitions
   
(df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath))
   mode="append"
   ct = spark.read.format("hudi").load(basePath + 
"/.hoodie/metadata").count()
   print("NB:"+str(ct) + " for iteration:" + str(i))
   
   
   NB:2 for iteration:1
   NB:3 for iteration:2
   NB:4 for iteration:3
   NB:5 for iteration:4
   NB:6 for iteration:5
   NB:7 for iteration:6
   NB:8 for iteration:7
   NB:9 for iteration:8
   NB:10 for iteration:9
   NB:21 for iteration:10  <--- MDT COMPACTION
   NB:32 for iteration:11
   NB:43 for iteration:12
   NB:54 for iteration:13
   NB:65 for iteration:14
   NB:76 for iteration:15
   NB:87 for iteration:16
   NB:98 for iteration:17
   NB:109 for iteration:18
   NB:120 for iteration:19
   NB:41 for iteration:20  <--- MDT COMPACTION
   NB:62 for iteration:21
   
   
   NB:2 for iteration:1
   NB:2 for iteration:2
   NB:2 for iteration:3
   NB:2 for iteration:4
   NB:2 for iteration:5
   NB:2 for iteration:6
   NB:2 for iteration:7
   NB:2 for iteration:8
   NB:2 for iteration:9
   NB:2 for iteration:10  <--- MDT COMPACTION
   NB:2 for iteration:11
   NB:2 for iteration:12
   NB:2 for iteration:13
   NB:2 for iteration:14
   NB:2 for iteration:15
   NB:2 for iteration:16
   NB:2 for iteration:17
   NB:2 for iteration:18
   NB:2 for iteration:19
   NB:2 for iteration:20  <--- MDT COMPACTION
   NB:2 for iteration:21
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] garyli1019 commented on a diff in pull request #8679: [RFC-69] Hudi 1.X

2023-07-09 Thread via GitHub


garyli1019 commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1257449903


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We have been adding new capabilities in the 0.x release line, but we can also 
turn the core of Hudi into a more general-purpose database experience for the 
lake. As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Flink, Trino and Hive were getting good at queries on columnar data 
files but painfully hard to integrate into. Over time, we expected clear API 
abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational 

[jira] [Closed] (HUDI-5861) table can not read date after overwrite table with bucket index

2023-07-09 Thread KnightChess (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KnightChess closed HUDI-5861.
-
Resolution: Won't Fix

> table can not read date after overwrite table with bucket index
> ---
>
> Key: HUDI-5861
> URL: https://issues.apache.org/jira/browse/HUDI-5861
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> a table with bucket index, if `insert overwrite` the same partition twice, 
> can not get any data will `query sql`
> reason: insert overwrite in bucket index will tagLocation, the new file 
> version will use the same fgId, so the lastest file slice will be filtered by 
> `replacecommit` file with the same fileGroupId.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5091) MergeInto syntax merge_condition does not support Non-Equal

2023-07-09 Thread KnightChess (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KnightChess closed HUDI-5091.
-
Fix Version/s: 0.14.0
   Resolution: Won't Fix

> MergeInto syntax merge_condition does not support Non-Equal
> ---
>
> Key: HUDI-5091
> URL: https://issues.apache.org/jira/browse/HUDI-5091
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
> Fix For: 0.14.0
>
>
> Merge into sql merge condition support Non-equal condition
> https://github.com/apache/hudi/issues/6400



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-5091) MergeInto syntax merge_condition does not support Non-Equal

2023-07-09 Thread KnightChess (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KnightChess resolved HUDI-5091.
---

> MergeInto syntax merge_condition does not support Non-Equal
> ---
>
> Key: HUDI-5091
> URL: https://issues.apache.org/jira/browse/HUDI-5091
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>
> Merge into sql merge condition support Non-equal condition
> https://github.com/apache/hudi/issues/6400



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6514) Remove hoodie.auto.commit

2023-07-09 Thread zouxxyy (Jira)
zouxxyy created HUDI-6514:
-

 Summary: Remove hoodie.auto.commit
 Key: HUDI-6514
 URL: https://issues.apache.org/jira/browse/HUDI-6514
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: zouxxyy


The usage of `hoodie.auto.commit` is very disgusting, I don't know the meaning 
of this configuration at all



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on issue #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource

2023-07-09 Thread via GitHub


yihua commented on issue #9026:
URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627630057

   Based on my investigation, it looks like that the HFile does not contain 
duplicate entries.  If the metadata table is queried through Spark datasource 
directly after MDT compaction (i.e., no additional log file in the latest file 
slice), there is no duplicate.  The duplicate seems to be related to Spark 
datasource read on MDT only, not affecting the MDT read through 
`HoodieBackedTableMetadata` APIs.
   
   The performance is a separate issue we need to look into.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org