date:20240606

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-06 Thread via GitHub



KnightChess commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1630650191


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SecondaryIndexSupport.scala:
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.SecondaryIndexSupport.filterQueriesWithSecondaryKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import 
org.apache.hudi.metadata.HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX
+import org.apache.hudi.storage.StoragePath
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+import scala.collection.JavaConverters._
+import scala.collection.{JavaConverters, mutable}
+
+class SecondaryIndexSupport(spark: SparkSession,
+metadataConfig: HoodieMetadataConfig,
+metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+  override def getIndexName: String = SecondaryIndexSupport.INDEX_NAME
+
+  override def computeCandidateFileNames(fileIndex: HoodieFileIndex,
+ queryFilters: Seq[Expression],
+ queryReferencedColumns: Seq[String],
+ prunedPartitionsAndFileSlices: 
Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])],
+ shouldPushDownFilesFilter: Boolean
+): Option[Set[String]] = {
+val secondaryKeyConfigOpt = getSecondaryKeyConfig(queryReferencedColumns, 
metaClient)
+if (secondaryKeyConfigOpt.isEmpty) {
+  Option.empty

Review Comment:
   return Option.empty



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SecondaryIndexSupport.scala:
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.SecondaryIndexSupport.filterQueriesWithSecondaryKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import 
org.apache.hudi.metadata.HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX
+import org.apache.hudi.storage.StoragePath
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+import scala.collection.JavaConverters._
+import scala.collection.{JavaConverters, mutable}
+
+class SecondaryIndexSupport(spark: SparkSession,
+metadataConfig: HoodieMetadataConfig,
+metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+  override def getIndexName: String = SecondaryIndexSupport.INDEX_NAME
+
+  override def computeCandidateFileNames(fileIndex: HoodieFileIndex,
+ queryFilters: Seq[Expression],
+

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2154087365

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * fe352bc8ce928773ff2c997c58b3ae1c07ae9ba9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24249)
 
   * db219f61a2efd1b601ba1ec14f7e2871df5d615d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24265)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2154064293

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * fe352bc8ce928773ff2c997c58b3ae1c07ae9ba9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24249)
 
   * db219f61a2efd1b601ba1ec14f7e2871df5d615d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-06 Thread via GitHub



codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1630642711


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -851,6 +862,167 @@ private Map 
reverseLookupSecondaryKeys(String partitionName, Lis
 return recordKeyMap;
   }
 
+  @Override
+  protected Map>> 
getSecondaryIndexRecords(List keys, String partitionName) {
+if (keys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");
+
+engineContext.setJobStatus(this.getClass().getSimpleName(), "Lookup keys 
from each file slice");
+HoodieData partitionRDD = 
engineContext.parallelize(partitionFileSlices);
+// Define the seqOp function (merges elements within a partition)
+Functions.Function2>>, FileSlice, Map>>> seqOp =
+(accumulator, partition) -> {
+  Map>> 
currentFileSliceResult = lookupSecondaryKeysFromFileSlice(partitionName, keys, 
partition);
+  currentFileSliceResult.forEach((secondaryKey, secondaryRecords) -> 
accumulator.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) -> {
+newRecords.addAll(oldRecords);
+return newRecords;
+  }));
+  return accumulator;
+};
+// Define the combOp function (merges elements across partitions)
+Functions.Function2>>, Map>>, Map>>> combOp =
+(map1, map2) -> {
+  map2.forEach((secondaryKey, secondaryRecords) -> 
map1.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) -> {
+newRecords.addAll(oldRecords);
+return newRecords;
+  }));
+  return map1;
+};
+// Use aggregate to merge results within and across partitions
+// Define the zero value (initial value)
+Map>> zeroValue = new 
HashMap<>();
+return engineContext.aggregate(partitionRDD, zeroValue, seqOp, combOp);
+  }
+
+  /**
+   * Lookup list of keys from a single file slice.
+   *
+   * @param partitionName Name of the partition
+   * @param secondaryKeys The list of secondary keys to lookup
+   * @param fileSlice The file slice to read
+   * @return A {@code Map} of secondary-key to list of {@code HoodieRecord} 
for the secondary-keys which were found in the file slice
+   */
+  private Map>> 
lookupSecondaryKeysFromFileSlice(String partitionName, List 
secondaryKeys, FileSlice fileSlice) {
+Map> logRecordsMap = new HashMap<>();
+
+Pair, HoodieMetadataLogRecordReader> readers = 
getOrCreateReaders(partitionName, fileSlice);
+try {
+  List timings = new ArrayList<>(1);
+  HoodieSeekingFileReader baseFileReader = readers.getKey();
+  HoodieMetadataLogRecordReader logRecordScanner = readers.getRight();
+  if (baseFileReader == null && logRecordScanner == null) {
+return Collections.emptyMap();
+  }
+
+  // Sort it here once so that we don't need to sort individually for base 
file and for each individual log files.
+  Set secondaryKeySet = new HashSet<>(secondaryKeys.size());
+  List sortedSecondaryKeys = new ArrayList<>(secondaryKeys);
+  Collections.sort(sortedSecondaryKeys);
+  secondaryKeySet.addAll(sortedSecondaryKeys);
+
+  logRecordScanner.getRecords().forEach(record -> {
+HoodieMetadataPayload payload = record.getData();
+String recordKey = payload.getRecordKeyFromSecondaryIndex();
+if (secondaryKeySet.contains(recordKey)) {
+  String secondaryKey = payload.getRecordKeyFromSecondaryIndex();
+  logRecordsMap.computeIfAbsent(secondaryKey, k -> new 
HashMap<>()).put(recordKey, record);
+}
+  });
+
+  return readNonUniqueRecordsAndMergeWithLogRecords(baseFileReader, 
sortedSecondaryKeys, logRecordsMap, timings, partitionName);
+} catch (IOException ioe) {
+  throw new HoodieIOException("Error merging records from metadata table 
for  " + secondaryKeys.size() + " key : ", ioe);
+} finally {
+  if (!reuse) {
+closeReader(readers);
+  }
+}
+  }
+
+  private Map>> 
readNonUniqueRecordsAndMergeWithLogRecords(HoodieSeekingFileReader reader,
+   
 List sortedKeys,
+   
 Map> 
logRecordsMap,
+

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2153925430

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Remove calcite dependency [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11411:
URL: https://github.com/apache/hudi/pull/11411#issuecomment-2153902999

   
   ## CI report:
   
   * 0035194f60ef355eeaa21e237fb8bd68579f48cd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2153783515

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11413:
URL: https://github.com/apache/hudi/pull/11413#issuecomment-2153778377

   
   ## CI report:
   
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153777847

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 04f69916d0bbc1bc8877a9fec5b38c40da1f7d46 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-7428) Support Netease Object Storage protocol for Hudi

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7428.
---
Resolution: Fixed

> Support Netease Object Storage protocol for Hudi
> 
>
> Key: HUDI-7428
> URL: https://issues.apache.org/jira/browse/HUDI-7428
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Support Netease Object Storage protocol for Hudi
> https://sf.163.com/product/nos
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7427) Improve meta sync latency logging

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7427:

Fix Version/s: 0.15.0
   1.0.0

> Improve meta sync latency logging
> -
>
> Key: HUDI-7427
> URL: https://issues.apache.org/jira/browse/HUDI-7427
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7428) Support Netease Object Storage protocol for Hudi

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7428:

Fix Version/s: 0.15.0

> Support Netease Object Storage protocol for Hudi
> 
>
> Key: HUDI-7428
> URL: https://issues.apache.org/jira/browse/HUDI-7428
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Support Netease Object Storage protocol for Hudi
> https://sf.163.com/product/nos
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7427) Improve meta sync latency logging

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7427.
---
Resolution: Fixed

> Improve meta sync latency logging
> -
>
> Key: HUDI-7427
> URL: https://issues.apache.org/jira/browse/HUDI-7427
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7423) Support table type name incase-sensitive when create table in sparksql

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7423:

Fix Version/s: 0.15.0

> Support table type name incase-sensitive when create table in sparksql
> --
>
> Key: HUDI-7423
> URL: https://issues.apache.org/jira/browse/HUDI-7423
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Support table type name incase-sensitive when create table in sparksql.
> When many user create table would set table.type = MOR/Mor/Cow/COW according 
> to Hudi document，like：
>  
> CREATE TABLE `hudi_test`.`hudi_test29` (
>   `app_id` STRING COMMENT '应用id',
>   `message_id` STRING COMMENT '消息id',
>   `dt` INT,
>   `from` STRING,
>   `day` STRING COMMENT '日期分区',
>   `hour` INT COMMENT '小时分区'
> )using hudi
> tblproperties (
>   'primaryKey' = 'app_id,message_id',
>   'type' = 'MOR',
>   'preCombineField'='dt',
>   'hoodie.index.type' = 'BUCKET',
>   'hoodie.bucket.index.hash.field' = 'app_id,message_id',
>   'hoodie.bucket.index.num.buckets'=256,
>   'hoodie.datasource.hive_sync.table.strategy'='RT'
> )
> PARTITIONED BY (`day`,`hour`);
>  
> it would occur error as: 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hudi.common.model.HoodieTableType.MOR
>     at java.lang.Enum.valueOf(Enum.java:238)
>     at 
> org.apache.hudi.common.model.HoodieTableType.valueOf(HoodieTableType.java:30)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.setTableType(HoodieTableMetaClient.java:833)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.fromProperties(HoodieTableMetaClient.java:1009)
>     at 
> org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.initHoodieTable(HoodieCatalogTable.scala:219)
>     at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableCommand.run(CreateHoodieTableCommand.scala:70)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> It is not friendly to user which must set table=mor/cow for users. So it is 
> better make the config incase-sensitive to users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7416) Add interface for StreamProfile to be used in StreamSync for reading and writing data.

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7416:

Issue Type: New Feature  (was: Improvement)

> Add interface for StreamProfile to be used in StreamSync for reading and 
> writing data. 
> ---
>
> Key: HUDI-7416
> URL: https://issues.apache.org/jira/browse/HUDI-7416
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinish Reddy
>Assignee: Vinish Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Introducing a new class known as {{StreamProfile}} which contains details 
> about how the next sync round in StreamSync should be consumed and written. 
> For eg:
> {{KafkaStreamProfile}} contains number of events to consume in this sync 
> round.
> {{S3StreamProfile}} contains the list of files to consume in this sync round
> {{HudiIncrementalStreamProfile}} contains the beginInstant and endInstant 
> commit times to consume in this sync round.
> In future we can add the method for choosing the writeOperationType and 
> indexType as well, for now {{streamProfile.getSourceSpecificContext() }}will 
> be used to consume the data from the source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7416) Add interface for StreamProfile to be used in StreamSync for reading and writing data.

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7416:

Fix Version/s: 0.15.0
   1.0.0

> Add interface for StreamProfile to be used in StreamSync for reading and 
> writing data. 
> ---
>
> Key: HUDI-7416
> URL: https://issues.apache.org/jira/browse/HUDI-7416
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinish Reddy
>Assignee: Vinish Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Introducing a new class known as {{StreamProfile}} which contains details 
> about how the next sync round in StreamSync should be consumed and written. 
> For eg:
> {{KafkaStreamProfile}} contains number of events to consume in this sync 
> round.
> {{S3StreamProfile}} contains the list of files to consume in this sync round
> {{HudiIncrementalStreamProfile}} contains the beginInstant and endInstant 
> commit times to consume in this sync round.
> In future we can add the method for choosing the writeOperationType and 
> indexType as well, for now {{streamProfile.getSourceSpecificContext() }}will 
> be used to consume the data from the source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7423) Support table type name incase-sensitive when create table in sparksql

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7423.
---
Resolution: Fixed

> Support table type name incase-sensitive when create table in sparksql
> --
>
> Key: HUDI-7423
> URL: https://issues.apache.org/jira/browse/HUDI-7423
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Support table type name incase-sensitive when create table in sparksql.
> When many user create table would set table.type = MOR/Mor/Cow/COW according 
> to Hudi document，like：
>  
> CREATE TABLE `hudi_test`.`hudi_test29` (
>   `app_id` STRING COMMENT '应用id',
>   `message_id` STRING COMMENT '消息id',
>   `dt` INT,
>   `from` STRING,
>   `day` STRING COMMENT '日期分区',
>   `hour` INT COMMENT '小时分区'
> )using hudi
> tblproperties (
>   'primaryKey' = 'app_id,message_id',
>   'type' = 'MOR',
>   'preCombineField'='dt',
>   'hoodie.index.type' = 'BUCKET',
>   'hoodie.bucket.index.hash.field' = 'app_id,message_id',
>   'hoodie.bucket.index.num.buckets'=256,
>   'hoodie.datasource.hive_sync.table.strategy'='RT'
> )
> PARTITIONED BY (`day`,`hour`);
>  
> it would occur error as: 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hudi.common.model.HoodieTableType.MOR
>     at java.lang.Enum.valueOf(Enum.java:238)
>     at 
> org.apache.hudi.common.model.HoodieTableType.valueOf(HoodieTableType.java:30)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.setTableType(HoodieTableMetaClient.java:833)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.fromProperties(HoodieTableMetaClient.java:1009)
>     at 
> org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.initHoodieTable(HoodieCatalogTable.scala:219)
>     at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableCommand.run(CreateHoodieTableCommand.scala:70)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
>     at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> It is not friendly to user which must set table=mor/cow for users. So it is 
> better make the config incase-sensitive to users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7418) Implement file extension filter for s3 incr source

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7418:

Fix Version/s: 0.15.0
   1.0.0

> Implement file extension filter for s3 incr source
> --
>
> Key: HUDI-7418
> URL: https://issues.apache.org/jira/browse/HUDI-7418
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We have support for filter the input files based on an extension (custom) for 
> GCS Incr Source that can be configured. But we don't have the same for the S3 
> incr source (which always assumes that file extension is same as the format 
> which may not be the case always).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-06-06 Thread via GitHub



KnightChess commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2153754087

   > @KnightChess do you have intreast to push-forward this feature?
   
   @danny0405 yes, I follow up this problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Remove calcite dependency [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11411:
URL: https://github.com/apache/hudi/pull/11411#issuecomment-2153746494

   
   ## CI report:
   
   * fc24469b4e9b8d14355f1439a9a3762feb137490 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24259)
 
   * 0035194f60ef355eeaa21e237fb8bd68579f48cd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153745981

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d14bbeef6bafabe49bd1375cf0eef82c1c27c597 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24257)
 
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24261)
 
   * 04f69916d0bbc1bc8877a9fec5b38c40da1f7d46 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7840) Add position merging back to file group reader

2024-06-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7840:
-
Labels: pull-request-available  (was: )

> Add position merging back to file group reader
> --
>
> Key: HUDI-7840
> URL: https://issues.apache.org/jira/browse/HUDI-7840
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Was removed to make change to fg reader but will now be added back with 
> proper fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7840] Add position merging to fg reader [hudi]

2024-06-06 Thread via GitHub



jonvex opened a new pull request, #11413:
URL: https://github.com/apache/hudi/pull/11413

   ### Change Logs
   
   Position merging was removed in 
[HUDI-7567](https://github.com/apache/hudi/pull/10957)
   but will now be added back with additional testing and proper use of 
filters. As well as positional merge for bootstrap
   
   ### Impact
   
   positional merge feature added back to fg reader
   
   ### Risk level (write none, low medium or high below)
   
   medium
   wrote tests
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7840) Add position merging back to file group reader

2024-06-06 Thread Jonathan Vexler (Jira)

Jonathan Vexler created HUDI-7840:
-

 Summary: Add position merging back to file group reader
 Key: HUDI-7840
 URL: https://issues.apache.org/jira/browse/HUDI-7840
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


Was removed to make change to fg reader but will now be added back with proper 
fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [MINOR] Remove calcite dependency [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #11411:
URL: https://github.com/apache/hudi/pull/11411#issuecomment-2153741527

   
   ## CI report:
   
   * fc24469b4e9b8d14355f1439a9a3762feb137490 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24259)
 
   * 0035194f60ef355eeaa21e237fb8bd68579f48cd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153740991

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d14bbeef6bafabe49bd1375cf0eef82c1c27c597 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24257)
 
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24261)
 
   * 04f69916d0bbc1bc8877a9fec5b38c40da1f7d46 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7415) Support OLAP engine query from origin table avoid empty result in default

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7415:

Fix Version/s: 0.15.0
   1.0.0

> Support OLAP engine query from origin table avoid empty result in default
> -
>
> Key: HUDI-7415
> URL: https://issues.apache.org/jira/browse/HUDI-7415
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> OLAP query need support read data from origin table by default，for 
> example，query from olap engine such as starrocks presto，we can only read data 
> in ro/rt sub table and get empty data from origin table，this is not fitable：
> query mor table with starrocks as:
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt;
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
>  
> |
> |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
> 1 row in set (2.11 sec)|
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro;
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
>  
> |
> |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
> 1 row in set (0.22 sec)|
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
> Empty set (1.23 sec)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7415) Support OLAP engine query from origin table avoid empty result in default

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7415.
---
Resolution: Fixed

> Support OLAP engine query from origin table avoid empty result in default
> -
>
> Key: HUDI-7415
> URL: https://issues.apache.org/jira/browse/HUDI-7415
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> OLAP query need support read data from origin table by default，for 
> example，query from olap engine such as starrocks presto，we can only read data 
> in ro/rt sub table and get empty data from origin table，this is not fitable：
> query mor table with starrocks as:
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt;
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
>  
> |
> |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
> 1 row in set (2.11 sec)|
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro;
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
>  
> |
> |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
> 1 row in set (0.22 sec)|
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
> Empty set (1.23 sec)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7406) Rename classes to be readable in storage abstraction

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7406:

Fix Version/s: 0.15.0

> Rename classes to be readable in storage abstraction
> 
>
> Key: HUDI-7406
> URL: https://issues.apache.org/jira/browse/HUDI-7406
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7410) Use SeekableDataInputStream as the input of native HFile reader

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7410:

Fix Version/s: 0.15.0

> Use SeekableDataInputStream as the input of native HFile reader
> ---
>
> Key: HUDI-7410
> URL: https://issues.apache.org/jira/browse/HUDI-7410
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7417) NPE in StreamSync if empty batch and schema and upsert operation new table

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7417:

Fix Version/s: 0.15.0
   1.0.0

> NPE in StreamSync if empty batch and schema and upsert operation new table
> --
>
> Key: HUDI-7417
> URL: https://issues.apache.org/jira/browse/HUDI-7417
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> null pointer exception due to empty schema if it is the first commit and is 
> upsert



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7362) Athena does not support s3a partition scheme anymore leading to missing data

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7362:

Fix Version/s: 1.0.0

>  Athena does not support s3a partition scheme anymore leading to missing data
> -
>
> Key: HUDI-7362
> URL: https://issues.apache.org/jira/browse/HUDI-7362
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> see https://github.com/apache/hudi/issues/10595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7104) Cleaner could miss to clean up some files w/ savepoint interplay

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7104:

Issue Type: Bug  (was: Improvement)

> Cleaner could miss to clean up some files w/ savepoint interplay 
> -
>
> Key: HUDI-7104
> URL: https://issues.apache.org/jira/browse/HUDI-7104
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, savepoint, table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Lets say partitioning is day based and is based on created date. So, older 
> partitions generally does not get any new data after few days. 
>  
> Lets say we have savepoints added to a day and later removed. 
> day 1: cleaned up. 
> day2: savepoint added. and so cleaner ignord. 
> day3: cleaned up 
> day4: earliest commit to retain based on cleaner configs. 
>  
> So, w/ this table/timeline state, if we remove the savepointed commit, data 
> pertaining to day2 will never be cleaned by the cleaner since its lesser than 
> the earliest commit to retain. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7357) Introduce generic StorageConfiguration

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7357:

Fix Version/s: 0.15.0

> Introduce generic StorageConfiguration
> --
>
> Key: HUDI-7357
> URL: https://issues.apache.org/jira/browse/HUDI-7357
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7104) Cleaner could miss to clean up some files w/ savepoint interplay

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7104:

Fix Version/s: 0.15.0
   1.0.0

> Cleaner could miss to clean up some files w/ savepoint interplay 
> -
>
> Key: HUDI-7104
> URL: https://issues.apache.org/jira/browse/HUDI-7104
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning, savepoint, table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Lets say partitioning is day based and is based on created date. So, older 
> partitions generally does not get any new data after few days. 
>  
> Lets say we have savepoints added to a day and later removed. 
> day 1: cleaned up. 
> day2: savepoint added. and so cleaner ignord. 
> day3: cleaned up 
> day4: earliest commit to retain based on cleaner configs. 
>  
> So, w/ this table/timeline state, if we remove the savepointed commit, data 
> pertaining to day2 will never be cleaned by the cleaner since its lesser than 
> the earliest commit to retain. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7104) Cleaner could miss to clean up some files w/ savepoint interplay

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7104.
---
Resolution: Fixed

> Cleaner could miss to clean up some files w/ savepoint interplay 
> -
>
> Key: HUDI-7104
> URL: https://issues.apache.org/jira/browse/HUDI-7104
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, savepoint, table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Lets say partitioning is day based and is based on created date. So, older 
> partitions generally does not get any new data after few days. 
>  
> Lets say we have savepoints added to a day and later removed. 
> day 1: cleaned up. 
> day2: savepoint added. and so cleaner ignord. 
> day3: cleaned up 
> day4: earliest commit to retain based on cleaner configs. 
>  
> So, w/ this table/timeline state, if we remove the savepointed commit, data 
> pertaining to day2 will never be cleaned by the cleaner since its lesser than 
> the earliest commit to retain. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7357) Introduce generic StorageConfiguration

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7357:

Issue Type: New Feature  (was: Improvement)

> Introduce generic StorageConfiguration
> --
>
> Key: HUDI-7357
> URL: https://issues.apache.org/jira/browse/HUDI-7357
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7397) Add support to purge a clustering instant

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7397.
---
Resolution: Fixed

> Add support to purge a clustering instant
> -
>
> Key: HUDI-7397
> URL: https://issues.apache.org/jira/browse/HUDI-7397
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, table-service
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> As of now, if a user made some mistake on clustering params and wishes to 
> completely purge a pending clustering, we do not have any support for that. 
> would be good to add the support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7364) Move InLineFs classes to hudi-hadoop-common module

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7364:

Fix Version/s: 0.15.0

> Move InLineFs classes to hudi-hadoop-common module
> --
>
> Key: HUDI-7364
> URL: https://issues.apache.org/jira/browse/HUDI-7364
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7397) Add support to purge a clustering instant

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7397:

Fix Version/s: 0.15.0
   1.0.0

> Add support to purge a clustering instant
> -
>
> Key: HUDI-7397
> URL: https://issues.apache.org/jira/browse/HUDI-7397
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, table-service
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> As of now, if a user made some mistake on clustering params and wishes to 
> completely purge a pending clustering, we do not have any support for that. 
> would be good to add the support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7373) Revert config name back to hoodie.write.set.null.for.missing.columns in master

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7373:

Fix Version/s: 0.15.0
   1.0.0

> Revert config name back to hoodie.write.set.null.for.missing.columns in master
> --
>
> Key: HUDI-7373
> URL: https://issues.apache.org/jira/browse/HUDI-7373
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> hoodie.write.set.null.for.missing.columns was renamed to 
> hoodie.write.handle.missing.cols.with.lossless.type.promotion. The config 
> only adds the missing columns. The "reverse type promotion" is not controlled 
> by a config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7394) Fix run script of hudi-timeline-server-bundle

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7394:

Fix Version/s: 1.0.0

> Fix run script of hudi-timeline-server-bundle
> -
>
> Key: HUDI-7394
> URL: https://issues.apache.org/jira/browse/HUDI-7394
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> h1. Issue Description
> Hudi's timeline server bundle relies/is using hadoop-hdfs's common-lib-jars 
> Avro version.
>  
> hadoop-hdfs might be shipped with avro versions of an older version 
> (depending on the hadoop-hdfs version). 
>  
> For example:
> {code:java}
> Running command : java -cp 
> ...:/usr/share/hadoop-client/share/hadoop/common/lib/avro-1.7.7.jar:... {code}
>  
> Class conflict occurs with the following error:
> {code:java}
> java.lang.NoClassDefFoundError: Lorg/apache/avro/message/BinaryMessageEncoder;
>     at java.lang.Class.getDeclaredFields0(Native Method)
>     at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
>     at java.lang.Class.getDeclaredField(Class.java:2068)
>     at 
> org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:240)
>     at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:189)
>     at 
> org.apache.avro.specific.SpecificDatumReader.(SpecificDatumReader.java:34)
>     at 
> org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:203)
>     at 
> org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeCompactionPlan(TimelineMetadataUtils.java:166)
>     at 
> org.apache.hudi.common.util.CompactionUtils.getCompactionPlan(CompactionUtils.java:149)
>     at 
> org.apache.hudi.common.util.CompactionUtils.lambda$getAllPendingCompactionPlans$2(CompactionUtils.java:139)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>     at 
> org.apache.hudi.common.util.CompactionUtils.getAllPendingCompactionPlans(CompactionUtils.java:143)
>     at 
> org.apache.hudi.common.util.CompactionUtils.getAllPendingCompactionOperations(CompactionUtils.java:163)
>     at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:113)
>     at 
> org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:108)
>     at 
> org.apache.hudi.common.table.view.SpillableMapBasedFileSystemView.(SpillableMapBasedFileSystemView.java:72)
>     at 
> org.apache.hudi.common.table.view.FileSystemViewManager.createSpillableMapBasedFileSystemView(FileSystemViewManager.java:156)
>     at 
> org.apache.hudi.common.table.view.FileSystemViewManager.lambda$createViewManager$31512a51$1(FileSystemViewManager.java:255)
>     at 
> org.apache.hudi.common.table.view.FileSystemViewManager.lambda$getFileSystemView$0(FileSystemViewManager.java:103)
>     at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
>     at 
> org.apache.hudi.common.table.view.FileSystemViewManager.getFileSystemView(FileSystemViewManager.java:101)
>     at 
> org.apache.hudi.timeline.service.handlers.TimelineHandler.getTimeline(TimelineHandler.java:50)
>     at 
> org.apache.hudi.timeline.service.RequestHandler.lambda$registerTimelineAPI$2(RequestHandler.java:237)
>     at 
> org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:518)
>     at io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22)
>     at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606)
>     at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46)
>     at io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17)
>     at io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143)
>     at io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41)
>     at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107)
>     at 
> io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>     at 
>

[jira] [Closed] (HUDI-7373) Revert config name back to hoodie.write.set.null.for.missing.columns in master

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7373.
---
Resolution: Fixed

> Revert config name back to hoodie.write.set.null.for.missing.columns in master
> --
>
> Key: HUDI-7373
> URL: https://issues.apache.org/jira/browse/HUDI-7373
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> hoodie.write.set.null.for.missing.columns was renamed to 
> hoodie.write.handle.missing.cols.with.lossless.type.promotion. The config 
> only adds the missing columns. The "reverse type promotion" is not controlled 
> by a config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7392) Fix connection leak causing lingering CLOSE_WAIT TCP connections

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7392:

Fix Version/s: 1.0.0

> Fix connection leak causing lingering CLOSE_WAIT TCP connections
> 
>
> Key: HUDI-7392
> URL: https://issues.apache.org/jira/browse/HUDI-7392
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When consistent_hashing is enabled and a long running Spark job 
> (Deltastreamer) is created, we noticed that there is a gradual increase in 
> CLOSE_WAIT connections originating from the AM -> HDFS DN. 
>  
> Command to check for close waits
> {code:java}
> netstat -anlpt | grep CLOSE_WAIT | grep 50010{code}
> Result
> {code:java}
> tcp6       1      0 10.1.2.3:45994      10.5.4.3:50010      CLOSE_WAIT  
> 2446/java
> tcp6       1      0 10.1.2.3:48478      10.6.5.4:50010      CLOSE_WAIT  
> 2446/java
> tcp6       1      0 10.1.2.3:49542      10.7.6.5:50010      CLOSE_WAIT  
> 2446/java
> tcp6       1      0 10.1.2.3:47220      10.8.7.6:50010      CLOSE_WAIT  
> 2446/java
> tcp6       1      0 10.1.2.3:49786      10.9.8.7:50010      CLOSE_WAIT  
> 2446/java {code}
>  
> Socket analysis using ss (last send) showed pointed us in the direction that 
> these CLOSE_WAITs were only created between INFLIGHT and COMPLETED instants 
> (inclusive), at this point in time. On top of that, this issue is only 
> reproducible in tables using consistent hashing index.
>  
> To reproduce this:
>  
> {code:java}
> CREATE TABLE dev_hudi.close_wait_issue_investigation (
>     id INT,
>     name STRING,
>     date_col STRING,
>     grass_region STRING
> ) USING hudi
> PARTITIONED BY (grass_region)
> tblproperties (
>     primaryKey = 'id',
>     type = 'mor',
>     precombineField = 'id',
>     hoodie.index.type = 'BUCKET',
>     hoodie.index.bucket.engine = 'CONSISTENT_HASHING',     
> hoodie.compact.inline = 'true'
> )
> LOCATION 'hdfs://DEV/close_wait_issue_investigation';
> INSERT INTO dev_hudi.close_wait_issue_investigation VALUES (1, 'alex1', 
> '2023-12-22', 'SG');
> INSERT INTO dev_hudi.close_wait_issue_investigation VALUES (2, 'alex2', 
> '2023-12-22', 'SG');
> INSERT INTO dev_hudi.close_wait_issue_investigation VALUES (3, 'alex3', 
> '2023-12-22', 'SG');
> INSERT INTO dev_hudi.close_wait_issue_investigation VALUES (4, 'alex4', 
> '2023-12-22', 'SG');
> INSERT INTO dev_hudi.close_wait_issue_investigation VALUES (5, 'alex5', 
> '2023-12-22', 'SG');{code}
>  
>  Observation:
> After every INSERT, there will be 1 new CLOSE_WAIT.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7387) Serializable Class need contains serialVersionUID to keep compatibility in upgrade

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7387:

Fix Version/s: 0.15.0

> Serializable Class need contains serialVersionUID to keep compatibility in 
> upgrade
> --
>
> Key: HUDI-7387
> URL: https://issues.apache.org/jira/browse/HUDI-7387
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Serializable Class need contains serialVersionUID to keep compatibility，avoid 
> report need 
> Serializable id in upgrade



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7367) Add makeQualified to HoodieLocation

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7367:

Fix Version/s: 0.15.0

> Add makeQualified to HoodieLocation
> ---
>
> Key: HUDI-7367
> URL: https://issues.apache.org/jira/browse/HUDI-7367
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7338) Bump HBase, pulsar-client, and jetty version

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7338:

Fix Version/s: 0.15.0

> Bump HBase, pulsar-client, and jetty version
> 
>
> Key: HUDI-7338
> URL: https://issues.apache.org/jira/browse/HUDI-7338
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> There is a major CVE spotted in jetty/netty: 
> [https://nvd.nist.gov/vuln/detail/CVE-2023-44487]
>  
> Bumping the version can help mitigate the problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [MINOR] Add calcite dependency explicitly [hudi]

2024-06-06 Thread via GitHub



CTTY commented on code in PR #11411:
URL: https://github.com/apache/hudi/pull/11411#discussion_r1630525725


##
hudi-spark-datasource/hudi-spark/pom.xml:
##
@@ -188,6 +188,12 @@
   ${project.version}
 
 
+
+  org.apache.calcite
+  calcite-core

Review Comment:
   This makes more sense, I'm modifying this PR to switch to use 
`scala.math.abs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Fix missing serDe properties post migration from hiveSync to glueSync [hudi]

2024-06-06 Thread via GitHub



danny0405 commented on PR #11404:
URL: https://github.com/apache/hudi/pull/11404#issuecomment-2153717046

   @prathit06 Can you check the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-06 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1630522596


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -851,6 +862,167 @@ private Map 
reverseLookupSecondaryKeys(String partitionName, Lis
 return recordKeyMap;
   }
 
+  @Override
+  protected Map>> 
getSecondaryIndexRecords(List keys, String partitionName) {
+if (keys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");
+
+engineContext.setJobStatus(this.getClass().getSimpleName(), "Lookup keys 
from each file slice");
+HoodieData partitionRDD = 
engineContext.parallelize(partitionFileSlices);
+// Define the seqOp function (merges elements within a partition)
+Functions.Function2>>, FileSlice, Map>>> seqOp =
+(accumulator, partition) -> {
+  Map>> 
currentFileSliceResult = lookupSecondaryKeysFromFileSlice(partitionName, keys, 
partition);
+  currentFileSliceResult.forEach((secondaryKey, secondaryRecords) -> 
accumulator.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) -> {
+newRecords.addAll(oldRecords);
+return newRecords;
+  }));
+  return accumulator;
+};
+// Define the combOp function (merges elements across partitions)
+Functions.Function2>>, Map>>, Map>>> combOp =
+(map1, map2) -> {
+  map2.forEach((secondaryKey, secondaryRecords) -> 
map1.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) -> {
+newRecords.addAll(oldRecords);
+return newRecords;
+  }));
+  return map1;
+};
+// Use aggregate to merge results within and across partitions
+// Define the zero value (initial value)
+Map>> zeroValue = new 
HashMap<>();
+return engineContext.aggregate(partitionRDD, zeroValue, seqOp, combOp);
+  }
+
+  /**
+   * Lookup list of keys from a single file slice.
+   *
+   * @param partitionName Name of the partition
+   * @param secondaryKeys The list of secondary keys to lookup
+   * @param fileSlice The file slice to read
+   * @return A {@code Map} of secondary-key to list of {@code HoodieRecord} 
for the secondary-keys which were found in the file slice
+   */
+  private Map>> 
lookupSecondaryKeysFromFileSlice(String partitionName, List 
secondaryKeys, FileSlice fileSlice) {
+Map> logRecordsMap = new HashMap<>();
+
+Pair, HoodieMetadataLogRecordReader> readers = 
getOrCreateReaders(partitionName, fileSlice);
+try {
+  List timings = new ArrayList<>(1);
+  HoodieSeekingFileReader baseFileReader = readers.getKey();
+  HoodieMetadataLogRecordReader logRecordScanner = readers.getRight();
+  if (baseFileReader == null && logRecordScanner == null) {
+return Collections.emptyMap();
+  }
+
+  // Sort it here once so that we don't need to sort individually for base 
file and for each individual log files.
+  Set secondaryKeySet = new HashSet<>(secondaryKeys.size());
+  List sortedSecondaryKeys = new ArrayList<>(secondaryKeys);
+  Collections.sort(sortedSecondaryKeys);
+  secondaryKeySet.addAll(sortedSecondaryKeys);
+
+  logRecordScanner.getRecords().forEach(record -> {
+HoodieMetadataPayload payload = record.getData();
+String recordKey = payload.getRecordKeyFromSecondaryIndex();
+if (secondaryKeySet.contains(recordKey)) {
+  String secondaryKey = payload.getRecordKeyFromSecondaryIndex();
+  logRecordsMap.computeIfAbsent(secondaryKey, k -> new 
HashMap<>()).put(recordKey, record);
+}
+  });
+
+  return readNonUniqueRecordsAndMergeWithLogRecords(baseFileReader, 
sortedSecondaryKeys, logRecordsMap, timings, partitionName);
+} catch (IOException ioe) {
+  throw new HoodieIOException("Error merging records from metadata table 
for  " + secondaryKeys.size() + " key : ", ioe);
+} finally {
+  if (!reuse) {
+closeReader(readers);
+  }
+}
+  }
+
+  private Map>> 
readNonUniqueRecordsAndMergeWithLogRecords(HoodieSeekingFileReader reader,
+   
 List sortedKeys,
+   
 Map> 
logRecordsMap,
+

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153699361

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d14bbeef6bafabe49bd1375cf0eef82c1c27c597 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24257)
 
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153693631

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d14bbeef6bafabe49bd1375cf0eef82c1c27c597 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24257)
 
   * 1a1ca64bec2fb94acce596934dd636b77cb0aca7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-06 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2153687538

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d14bbeef6bafabe49bd1375cf0eef82c1c27c597 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7366) Fix HoodieLocation with encoded paths

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7366:

Fix Version/s: 0.15.0

> Fix HoodieLocation with encoded paths
> -
>
> Key: HUDI-7366
> URL: https://issues.apache.org/jira/browse/HUDI-7366
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Encoded path like "s3://foo/bar/1%2F2%2F3" should be kept as is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7375) Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7375:

Fix Version/s: (was: 0.15.0)

> Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks
> 
>
> Key: HUDI-7375
> URL: https://issues.apache.org/jira/browse/HUDI-7375
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> {code:java}
> Error:  testLogReaderWithDifferentVersionsOfDeleteBlocks{DiskMapType, 
> boolean, boolean, boolean}[13]  Time elapsed: 0.043 s  <<< ERROR!
> 3421org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /user/root/[13] BITCASK, false, true, 
> false1706913234251/partition_path/.test-fileid1_100.log.1_1-0-1 could only be 
> written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running 
> and 3 node(s) are excluded in this operation.
> 3422  at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2338)
> 3423  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
> 3424  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2989)
> 3425  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:911)
> 3426  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> 3427  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 3428  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
> 3429  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> 3430  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> 3431  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
> 3432  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
> 3433  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
> 3434  at java.security.AccessController.doPrivileged(Native Method)
> 3435  at javax.security.auth.Subject.doAs(Subject.java:422)
> 3436  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> 3437  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> 3438
> 3439  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
> 3440  at org.apache.hadoop.ipc.Client.call(Client.java:1558)
> 3441  at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 3442  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
> 3443  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
> 3444  at jdk.proxy2/jdk.proxy2.$Proxy43.addBlock(Unknown Source)
> 3445  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:530)
> 3446  at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> 3447  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 3448  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> 3449  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> 3450  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> 3451  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> 3452  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> 3453  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> 3454  at jdk.proxy2/jdk.proxy2.$Proxy44.addBlock(Unknown Source)
> 3455  at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
> 3456  at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1915)
> 3457  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1717)
> 3458  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
> 3459 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7375) Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7375:

Fix Version/s: 0.15.0
   1.0.0

> Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks
> 
>
> Key: HUDI-7375
> URL: https://issues.apache.org/jira/browse/HUDI-7375
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> {code:java}
> Error:  testLogReaderWithDifferentVersionsOfDeleteBlocks{DiskMapType, 
> boolean, boolean, boolean}[13]  Time elapsed: 0.043 s  <<< ERROR!
> 3421org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /user/root/[13] BITCASK, false, true, 
> false1706913234251/partition_path/.test-fileid1_100.log.1_1-0-1 could only be 
> written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running 
> and 3 node(s) are excluded in this operation.
> 3422  at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2338)
> 3423  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
> 3424  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2989)
> 3425  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:911)
> 3426  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> 3427  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 3428  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
> 3429  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> 3430  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> 3431  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
> 3432  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
> 3433  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
> 3434  at java.security.AccessController.doPrivileged(Native Method)
> 3435  at javax.security.auth.Subject.doAs(Subject.java:422)
> 3436  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> 3437  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> 3438
> 3439  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
> 3440  at org.apache.hadoop.ipc.Client.call(Client.java:1558)
> 3441  at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 3442  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
> 3443  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
> 3444  at jdk.proxy2/jdk.proxy2.$Proxy43.addBlock(Unknown Source)
> 3445  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:530)
> 3446  at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> 3447  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 3448  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> 3449  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> 3450  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> 3451  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> 3452  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> 3453  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> 3454  at jdk.proxy2/jdk.proxy2.$Proxy44.addBlock(Unknown Source)
> 3455  at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
> 3456  at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1915)
> 3457  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1717)
> 3458  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
> 3459 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6868) Hudi HiveSync doesn't support extracting passwords from credential store

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6868:

Fix Version/s: 0.15.0
   1.0.0

> Hudi HiveSync doesn't support extracting passwords from credential store
> 
>
> Key: HUDI-6868
> URL: https://issues.apache.org/jira/browse/HUDI-6868
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, hudi-utilities, spark
>Reporter: Kuldeep Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: pyspark_hudi_test.py
>
>
> We have a customer use-case of running PySpark on [Dataproc 
> Serverless|https://cloud.google.com/dataproc-serverless/docs/overview] with 
> [hudi-spark3-bundle|https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3-bundle],
>  PySpark job fails to sync Hudi table with HMS DB(remote CloudSQL DB 
> instance) due to not able to extract the password from the credential store. 
> Same job works fine if we mention Hive Metstore DB user password instead of 
> credential store. 
> Checking 
> [code|https://github.com/apache/hudi/blob/release-0.12.3/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java]
>  for HiveSync configs or 
> [HiveSyncConfigHolder|https://github.com/apache/hudi/blob/73c2167566730a76a0650d488511253ebc66156f/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java#L44],
>  I don't see any option where it detects credential store for extracting 
> passwords. Something like [this 
> code|https://github.com/apache/hive/blob/rel/release-2.3.9/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L482]
>  from HMS ObjectStore.
> [Hive Sync Config Document|https://hudi.apache.org/docs/syncing_metastore/] 
> also doesn't have any reference of using credential store. 
> In order to find the password through the Hadoop Credential Provider API, it 
> would need to make a call to 
> [`Configuration#getPassword(String)`|https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-].
>  We don't see anywhere in the Hudi codebase calling "getPassword"
>  
> *Repro steps:*
>  
> Sample PySpark script - Attached. 
>  
> Command with successful job execution with Metastore DB password:
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io//new-custom-debian:v4 --region  pyspark 
> gs:///pyspark_hudi_test.py 
> --jars="gs:///hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword="
>  --deps-bucket gs:// -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* 
> {code}
>  
> Failing command ( with credential store):
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io//new-custom-debian:v4 --region  pyspark 
> gs:///pyspark_hudi_test.py 
> --jars="gs:///hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@/metastore-pass-v2.jceks"
>  --deps-bucket gs:// -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*  
> {code}
>  
> Error:
> {code:java}
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444 
> successful!
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled 
> ? false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? 
> false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HiveConf: Found configuration file null
> [..]
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from gs:///
> 23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from 
> gs:///.hoodie/hoodie.properties
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs:///
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline 
> for gs:///
> 23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto : 
> Option\{val=[20230911042953444__commit__COMPLETED]}
> 23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with 
> implementation

[jira] [Updated] (HUDI-7344) Use Java [Input/Output]Stream instead of FSData[Input/Output]Stream when possible

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7344:

Fix Version/s: 0.15.0

> Use Java [Input/Output]Stream instead of FSData[Input/Output]Stream when 
> possible
> -
>
> Key: HUDI-7344
> URL: https://issues.apache.org/jira/browse/HUDI-7344
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7218) Integrate new HFile reader with HoodieHFileReader

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7218:

Fix Version/s: 0.15.0

> Integrate new HFile reader with HoodieHFileReader
> -
>
> Key: HUDI-7218
> URL: https://issues.apache.org/jira/browse/HUDI-7218
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7345) Remove usage of org.apache.hadoop.util.VersionUtil

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7345:

Fix Version/s: 0.15.0

> Remove usage of org.apache.hadoop.util.VersionUtil
> --
>
> Key: HUDI-7345
> URL: https://issues.apache.org/jira/browse/HUDI-7345
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7343) Replace Path.SEPARATOR with HoodieLocation.SEPARATOR

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7343:

Fix Version/s: 0.15.0

> Replace Path.SEPARATOR with HoodieLocation.SEPARATOR
> 
>
> Key: HUDI-7343
> URL: https://issues.apache.org/jira/browse/HUDI-7343
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7347) Introduce SeekableDataInputStream for random access

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7347:

Fix Version/s: 0.15.0

> Introduce SeekableDataInputStream for random access
> ---
>
> Key: HUDI-7347
> URL: https://issues.apache.org/jira/browse/HUDI-7347
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6868) Hudi HiveSync doesn't support extracting passwords from credential store

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6868.
---
Resolution: Fixed

> Hudi HiveSync doesn't support extracting passwords from credential store
> 
>
> Key: HUDI-6868
> URL: https://issues.apache.org/jira/browse/HUDI-6868
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, hudi-utilities, spark
>Reporter: Kuldeep Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: pyspark_hudi_test.py
>
>
> We have a customer use-case of running PySpark on [Dataproc 
> Serverless|https://cloud.google.com/dataproc-serverless/docs/overview] with 
> [hudi-spark3-bundle|https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3-bundle],
>  PySpark job fails to sync Hudi table with HMS DB(remote CloudSQL DB 
> instance) due to not able to extract the password from the credential store. 
> Same job works fine if we mention Hive Metstore DB user password instead of 
> credential store. 
> Checking 
> [code|https://github.com/apache/hudi/blob/release-0.12.3/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java]
>  for HiveSync configs or 
> [HiveSyncConfigHolder|https://github.com/apache/hudi/blob/73c2167566730a76a0650d488511253ebc66156f/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java#L44],
>  I don't see any option where it detects credential store for extracting 
> passwords. Something like [this 
> code|https://github.com/apache/hive/blob/rel/release-2.3.9/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L482]
>  from HMS ObjectStore.
> [Hive Sync Config Document|https://hudi.apache.org/docs/syncing_metastore/] 
> also doesn't have any reference of using credential store. 
> In order to find the password through the Hadoop Credential Provider API, it 
> would need to make a call to 
> [`Configuration#getPassword(String)`|https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-].
>  We don't see anywhere in the Hudi codebase calling "getPassword"
>  
> *Repro steps:*
>  
> Sample PySpark script - Attached. 
>  
> Command with successful job execution with Metastore DB password:
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io//new-custom-debian:v4 --region  pyspark 
> gs:///pyspark_hudi_test.py 
> --jars="gs:///hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword="
>  --deps-bucket gs:// -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* 
> {code}
>  
> Failing command ( with credential store):
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io//new-custom-debian:v4 --region  pyspark 
> gs:///pyspark_hudi_test.py 
> --jars="gs:///hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@/metastore-pass-v2.jceks"
>  --deps-bucket gs:// -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*  
> {code}
>  
> Error:
> {code:java}
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444 
> successful!
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled 
> ? false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? 
> false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HiveConf: Found configuration file null
> [..]
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from gs:///
> 23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from 
> gs:///.hoodie/hoodie.properties
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs:///
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline 
> for gs:///
> 23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto : 
> Option\{val=[20230911042953444__commit__COMPLETED]}
> 23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with 
> implementation class:org.apache.hadoop.hive.metastore.ObjectStore
> 23/09/11 04:30:43

[jira] [Updated] (HUDI-7346) Remove usage of org.apache.hadoop.hbase.util.Bytes

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7346:

Fix Version/s: 0.15.0

> Remove usage of org.apache.hadoop.hbase.util.Bytes
> --
>
> Key: HUDI-7346
> URL: https://issues.apache.org/jira/browse/HUDI-7346
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7340) Use spillable map for cached log records in HoodieBaseFileGroupRecordBuffer

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7340:

Fix Version/s: 0.15.0

> Use spillable map for cached log records in HoodieBaseFileGroupRecordBuffer
> ---
>
> Key: HUDI-7340
> URL: https://issues.apache.org/jira/browse/HUDI-7340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: Danny Chen
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7336) Introduce new HoodieStorage abstraction

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7336.
---
Resolution: Fixed

> Introduce new HoodieStorage abstraction
> ---
>
> Key: HUDI-7336
> URL: https://issues.apache.org/jira/browse/HUDI-7336
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7336) Introduce new HoodieStorage abstraction

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7336:
---

Assignee: Ethan Guo

> Introduce new HoodieStorage abstraction
> ---
>
> Key: HUDI-7336
> URL: https://issues.apache.org/jira/browse/HUDI-7336
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7351) Hive-sync partition pushdown does not work with glue

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7351:

Fix Version/s: 0.15.0

> Hive-sync partition pushdown does not work with glue
> 
>
> Key: HUDI-7351
> URL: https://issues.apache.org/jira/browse/HUDI-7351
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> https://github.com/apache/hudi/issues/10569



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7336) Introduce new HoodieStorage abstraction

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7336:

Fix Version/s: 0.15.0
   1.0.0

> Introduce new HoodieStorage abstraction
> ---
>
> Key: HUDI-7336
> URL: https://issues.apache.org/jira/browse/HUDI-7336
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7342) Use BaseFileUtils to hide format-specific logic in HoodiePartitionMetadata

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7342:

Fix Version/s: 0.15.0

> Use BaseFileUtils to hide format-specific logic in HoodiePartitionMetadata
> --
>
> Key: HUDI-7342
> URL: https://issues.apache.org/jira/browse/HUDI-7342
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7327) hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work with HoodieIncrSource unless meta cols are dropped

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7327:

Fix Version/s: 0.15.0
   1.0.0

> hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work 
> with HoodieIncrSource unless meta cols are dropped
> --
>
> Key: HUDI-7327
> URL: https://issues.apache.org/jira/browse/HUDI-7327
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> The incoming meta cols are treated as new columns which is not allowed by 
> internalschema so it fails



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7323) Transformer schema inference uses stale schema

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7323:

Fix Version/s: 0.15.0
   1.0.0

> Transformer schema inference uses stale schema
> --
>
> Key: HUDI-7323
> URL: https://issues.apache.org/jira/browse/HUDI-7323
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> The `transformedSchema` interface for the Transformer class should use an up 
> to date schema instead of the schema at the time of object creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7323) Transformer schema inference uses stale schema

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7323.
---
Resolution: Fixed

> Transformer schema inference uses stale schema
> --
>
> Key: HUDI-7323
> URL: https://issues.apache.org/jira/browse/HUDI-7323
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> The `transformedSchema` interface for the Transformer class should use an up 
> to date schema instead of the schema at the time of object creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7335) Create hudi-hadoop-common for hadoop-specific implementation

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7335:

Fix Version/s: 0.15.0

> Create hudi-hadoop-common for hadoop-specific implementation
> 
>
> Key: HUDI-7335
> URL: https://issues.apache.org/jira/browse/HUDI-7335
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7228) Close LogFileReaders eaglerly with LogRecordReader

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7228.
---
Resolution: Fixed

> Close LogFileReaders eaglerly with LogRecordReader
> --
>
> Key: HUDI-7228
> URL: https://issues.apache.org/jira/browse/HUDI-7228
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7316) AbstractHoodieLogRecordReader should accept already-constructed HoodieTableMetaClient in order to reduce occurrences of file listing calls when reloading active timeline

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7316:

Fix Version/s: 0.15.0

> AbstractHoodieLogRecordReader should accept already-constructed 
> HoodieTableMetaClient in order to reduce occurrences of file listing calls 
> when reloading active timeline
> -
>
> Key: HUDI-7316
> URL: https://issues.apache.org/jira/browse/HUDI-7316
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Currently some implementors of AbstractHoodieLogRecordReader create a 
> HoodieTableMetaClient on construction, which implicitly reloads active 
> timeline, causing a {{listStatus}} HDFS call. Since when using Spark engine 
> these are created in Spark executors, a Spark user may have hundreds to 
> thousands of executors that will make a {{listStatus}} call at the same time 
> (during a Spark stage). To avoid these redundant calls to the HDFS NameNode 
> (or any distributed filesystem service in general), users of 
> AbstractHoodieLogRecordReader and implementations should pass in 
> already-constructed HoodieTableMetaClient.
> As long as the caller passed in a HoodieTableMetaClient with active timeline 
> already loaded, and the implementation doesn't need to re-load the timeline 
> (such as in order to get a more "fresh" timeline) then these calls will be 
> avoided in the executor, without causing the logic to be incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7228) Close LogFileReaders eaglerly with LogRecordReader

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7228:

Fix Version/s: 0.15.0
   1.0.0

> Close LogFileReaders eaglerly with LogRecordReader
> --
>
> Key: HUDI-7228
> URL: https://issues.apache.org/jira/browse/HUDI-7228
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7311) Comparing date with date literal in string format causes class cast exception during filter push down

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7311:

Fix Version/s: 0.15.0

> Comparing date with date literal in string format causes class cast exception 
> during filter push down
> -
>
> Key: HUDI-7311
> URL: https://issues.apache.org/jira/browse/HUDI-7311
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Given any table with arbitrary field typed date (e.g. field d_date with type 
> of date). And execute the SQL with conditions for this field in where clause.
> {code:sql}
> select d_date from xxx where d_date = '2020-01-01'
> {code}
> An exception will occur:
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
> at 
> org.apache.hudi.source.ExpressionPredicates.toParquetPredicate(ExpressionPredicates.java:613)
> at 
> org.apache.hudi.source.ExpressionPredicates.access$100(ExpressionPredicates.java:64)
> at 
> org.apache.hudi.source.ExpressionPredicates$ColumnPredicate.filter(ExpressionPredicates.java:226)
> at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
> at 
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
> Hudi Flink cannot convert the date literal in String format to Integer (the 
> primitive type of date). However this SQL in Flink without Hudi works well.
> In summary, we should add literal type auto conversion before filter push 
> down.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7298) Write bad records to error table in more cases instead of failing stream

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7298:

Fix Version/s: 0.15.0
   1.0.0

> Write bad records to error table in more cases instead of failing stream
> 
>
> Key: HUDI-7298
> URL: https://issues.apache.org/jira/browse/HUDI-7298
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> If no transformer is used, but schema provider is used, records with the 
> incorrect schema will not be detected and will fail the stream during 
> HoodieRecord creation. Additionally, during keygeneration the stream can 
> crash if required fields are null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7315) Disable constructing NOT filter predicate when pushing down its wrapped filter unsupported, as its operand's primitive value is incomparable.

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7315:

Fix Version/s: 0.15.0

> Disable constructing NOT filter predicate when pushing down its wrapped 
> filter unsupported, as its operand's primitive value is incomparable.
> -
>
> Key: HUDI-7315
> URL: https://issues.apache.org/jira/browse/HUDI-7315
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
> Environment: Flink 1.17.1
> Hudi 0.14.x
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> This issue is extended from HUDI-7309, as the risk still exists for the NOT 
> filter predicate when the predicate it wraps does not support pushing down 
> (e.g. expression with the operand typed Decimal).
> It is similar to the issue of AND/OR filter in HUDI-7309. Though I have not 
> yet reproduced NOT filter issue in practice, the risk still exists. We should 
> fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7237.
---
Resolution: Fixed

> Minor Improvements to Schema Handling in Delta Sync
> ---
>
> Key: HUDI-7237
> URL: https://issues.apache.org/jira/browse/HUDI-7237
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> There are a two minor items that we have run into running DeltaStreamer in 
> production.
> 1. The number of times the schema is fetched is more than it needs to be and 
> can put unnecessary load on schema providers or increase file system reads
> 2. SchemaProviders that return null target schemas on empty batches cause 
> null schema values in commits leading to unexpected issues later
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7237:

Fix Version/s: 0.15.0
   1.0.0

> Minor Improvements to Schema Handling in Delta Sync
> ---
>
> Key: HUDI-7237
> URL: https://issues.apache.org/jira/browse/HUDI-7237
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> There are a two minor items that we have run into running DeltaStreamer in 
> production.
> 1. The number of times the schema is fetched is more than it needs to be and 
> can put unnecessary load on schema providers or increase file system reads
> 2. SchemaProviders that return null target schemas on empty batches cause 
> null schema values in commits leading to unexpected issues later
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7309) Disable filter pushing down when the parquet type corresponding to its field logical type is not comparable

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7309:

Fix Version/s: 0.15.0

> Disable filter pushing down when the parquet type corresponding to its field 
> logical type is not comparable
> ---
>
> Key: HUDI-7309
> URL: https://issues.apache.org/jira/browse/HUDI-7309
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
> Environment: Hudi 0.14.0
> Hudi 0.14.1rc1
> Flink 1.17.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Given thee table web_sales from TPCDS:
> {code:sql}
> CREATE TABLE web_sales (
>ws_sold_date_sk int,
>ws_sold_time_sk int,
>ws_ship_date_sk int,
>ws_item_sk int,
>ws_bill_customer_sk int,
>ws_bill_cdemo_sk int,
>ws_bill_hdemo_sk int,
>ws_bill_addr_sk int,
>ws_ship_customer_sk int,
>ws_ship_cdemo_sk int,
>ws_ship_hdemo_sk int,
>ws_ship_addr_sk int,
>ws_web_page_sk int,
>ws_web_site_sk int,
>ws_ship_mode_sk int,
>ws_warehouse_sk int,
>ws_promo_sk int,
>ws_order_number int,
>ws_quantity int,
>ws_wholesale_cost decimal(7,2),
>ws_list_price decimal(7,2),
>ws_sales_price decimal(7,2),
>ws_ext_discount_amt decimal(7,2),
>ws_ext_sales_price decimal(7,2),
>ws_ext_wholesale_cost decimal(7,2),
>ws_ext_list_price decimal(7,2),
>ws_ext_tax decimal(7,2),
>ws_coupon_amt decimal(7,2),
>ws_ext_ship_cost decimal(7,2),
>ws_net_paid decimal(7,2),
>ws_net_paid_inc_tax decimal(7,2),
>ws_net_paid_inc_ship decimal(7,2),
>ws_net_paid_inc_ship_tax decimal(7,2),
>ws_net_profit decimal(7,2)
> ) with (
> 'connector' = 'hudi',
> 'path' = 'hdfs://path/to/web_sales',
> 'table.type' = 'COPY_ON_WRITE',
> 'hoodie.datasource.write.recordkey.field' = 
> 'ws_item_sk,ws_order_number'
>   );
> {code}
> And execute:
> {code:sql}
> select * from web_sales where ws_sold_date_sk = 2451268 and ws_sales_price 
> between 100.00 and 150.00
> {code}
> An exception will occur:
> {code:java}
> Caused by: java.lang.NullPointerException: left cannot be null
> at java.util.Objects.requireNonNull(Objects.java:228)
> at 
> org.apache.parquet.filter2.predicate.Operators$BinaryLogicalFilterPredicate.(Operators.java:257)
> at 
> org.apache.parquet.filter2.predicate.Operators$And.(Operators.java:301)
> at 
> org.apache.parquet.filter2.predicate.FilterApi.and(FilterApi.java:249)
> at 
> org.apache.hudi.source.ExpressionPredicates$And.filter(ExpressionPredicates.java:551)
> at 
> org.apache.hudi.source.ExpressionPredicates$Or.filter(ExpressionPredicates.java:589)
> at 
> org.apache.hudi.source.ExpressionPredicates$Or.filter(ExpressionPredicates.java:589)
> at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
> at 
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
> After further investigation, decimal type is not comparable in the form it 
> stored in parquet format (fix length byte array). The way that pushes down 
> this filter to parquet predicates are not 
>

[jira] [Updated] (HUDI-7303) Date field type unexpectedly convert to Long when using date comparison operator

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7303:

Fix Version/s: 0.15.0

> Date field type unexpectedly convert to Long when using date comparison 
> operator
> 
>
> Key: HUDI-7303
> URL: https://issues.apache.org/jira/browse/HUDI-7303
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
> Environment: Flink 1.15.4 Hudi 0.14.0
> Flink 1.17.1 Hudi 0.14.0
> Flink 1.17.1 Hudi 0.14.1rc1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Given the table date_dim from TPCDS as an example:
> {code:java}
> CREATE TABLE date_dim (
>   d_date_sk int,
>   d_date_id varchar(16) NOT NULL,
>   d_date date,
>   d_month_seq int,
>   d_week_seq int,
>   d_quarter_seq int,
>   d_year int,
>   d_dow int,
>   d_moy int,
>   d_dom int,
>   d_qoy int,
>   d_fy_year int, 
>   d_fy_quarter_seq int,
>   d_fy_week_seq int,
>   d_day_name varchar(9)
>   d_quarter_name varchar(6),
>   d_holiday char(1),
>   d_weekend char(1),
>   d_following_holiday char(1),
>   d_first_dom int,
>   d_last_dom int,
>   d_same_day_ly int,
>   d_same_day_lq int,
>   d_current_day char(1),
>   d_current_week char(1),
>   d_current_month char(1),
>   d_current_quarter char(1),
>   d_current_year char(1)) with (
>   'connector' = 'hudi',
>   'path' = 'hdfs:///table_path/date_dim',
>   'table.type' = 'COPY_ON_WRITE'); {code}
> When you execute the following select statement, an exception will be thrown:
> {code:java}
> select * from date_dim where d_date between cast('1999-02-22' as date) and 
> (cast('1999-02-22' as date) + INTERVAL '30' day);
> {code}
> The exception is:
> {code:java}
> java.lang.IllegalArgumentException: FilterPredicate column: d_date's declared 
> type (java.lang.Long) does not match the schema found in file metadata. 
> Column d_date is of type: INT32
> Valid types for this column are: [class java.lang.Integer]
>   at 
> org.apache.parquet.filter2.predicate.ValidTypeMap.assertTypeValid(ValidTypeMap.java:125)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:179)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:113)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.Operators$GtEq.accept(Operators.java:246)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:119)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:306) 
> ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:67)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:142)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:153)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:78)
>

[jira] [Updated] (HUDI-7277) hoodie.bulkinsert.shuffle.parallelism not activated with no-partitioned table

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7277:

Fix Version/s: 0.15.0

> hoodie.bulkinsert.shuffle.parallelism not activated with no-partitioned table
> -
>
> Key: HUDI-7277
> URL: https://issues.apache.org/jira/browse/HUDI-7277
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: KnightChess
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> [https://github.com/apache/hudi/issues/10418#issuecomment-1880454835]
> if a no-partitioned table use bulk insert,` 
> hoodie.bulkinsert.shuffle.parallelism` can not controller shuffle 
> parallelism, because without sort.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7317) FlinkTableFactory snatifyCheck should contains index type

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7317:

Fix Version/s: 0.15.0

> FlinkTableFactory snatifyCheck should contains index type
> -
>
> Key: HUDI-7317
> URL: https://issues.apache.org/jira/browse/HUDI-7317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> FlinkTableFactory snatifyCheck should contains index type



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7314) Hudi Create table support index type check

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7314:

Fix Version/s: 0.15.0
   1.0.0

> Hudi Create table support index type check
> --
>
> Key: HUDI-7314
> URL: https://issues.apache.org/jira/browse/HUDI-7314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Currently Hudi is not check index type when create table，even when user set 
> inaccurate index name is passed when set absent value. Need fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7314) Hudi Create table support index type check

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7314.
---
Resolution: Fixed

> Hudi Create table support index type check
> --
>
> Key: HUDI-7314
> URL: https://issues.apache.org/jira/browse/HUDI-7314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Currently Hudi is not check index type when create table，even when user set 
> inaccurate index name is passed when set absent value. Need fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7297) Exception thrown when field type mismatch is ambiguous

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7297:

Fix Version/s: 0.15.0

> Exception thrown when field type mismatch is ambiguous
> --
>
> Key: HUDI-7297
> URL: https://issues.apache.org/jira/browse/HUDI-7297
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> If you create a table with mismatched file types in Flink SQL, for example 
> you define a field as bigint while the actual field type is int, an 
> IllegalArgumentException would be thrown like below:
> java.lang.IllegalArgumentException: Unexpected type: INT32
> The exception is way too ambiguous. It is difficult to figure out which field 
> type is incorrect and what the correct type is. You have to refer to the 
> source code.
> Currently I plan to make the exception message more informative. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7305) Fix cast exception while reading byte/short/float type of partitioned field

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7305:

Fix Version/s: 0.15.0

> Fix cast exception while reading byte/short/float type of partitioned field
> ---
>
> Key: HUDI-7305
> URL: https://issues.apache.org/jira/browse/HUDI-7305
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Fix cast exception while reading byte/short/float type of partitioned field



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7170) Implement HFile reader independent of HBase

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7170:

Fix Version/s: 0.15.0

> Implement HFile reader independent of HBase
> ---
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We'd like to provide our own implementation o HFile reader which does not use 
> HBase dependencies.  In the long term, we should also decouple the HFile 
> reader from hadoop FileSystem abstractions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7246) Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7246.
---
Resolution: Fixed

> Data Skipping Issue: No Results When Query Conditions Involve Both Columns 
> with and without Column Stats
> 
>
> Key: HUDI-7246
> URL: https://issues.apache.org/jira/browse/HUDI-7246
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ma Jian
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> In the current code version, support for column stats has not yet been 
> extended to handle complex nested data types, such as map-type data 
> structures. Take the table tbl as an example, which is defined with three 
> fields: an integer field id, a string field name, and a map-type field 
> attributes. Within this table structure, the id and name fields support 
> column stats, and as such, HUDI will generate the corresponding column stats 
> indices for these two fields at the time of table creation. However, no 
> corresponding index will be generated for the attributes field. The specific 
> table creation statement is as follows:
> create table tbl (
>   id int,
>   name string,
>   attributes map
> ) ...
> To elaborate further, consider the following insert operation:
> insert into tbl values
> (1, 'a1', map('color', 'red', 'size', 'M')),
> (2, 'a2', map('color', 'blue', 'size', 'L'));After the execution of the 
> insert, the content of the column stats should be as follows:
> a.parquet   id min: 1   max: 1   null: 0
> b.parquet   id min: 2   max: 2   null: 0
> a.parquet   name   min: 'a1' max: 'a1' null: 0
> b.parquet   name   min: 'a2' max: 'a2' null: 0{{}}
> This means that there is no column stats index for the attributes column.
> Based on the table tbl, when we execute a query:
> h3. 1.Queries containing only columns supported by column stats:
> At this point, the data skipping code looks like this:
> columnStatsIndex.loadTransposed(queryReferencedColumns, shouldReadInMemory) 
> \{ transposedColStatsDF =>
>   Some(getCandidateFiles(transposedColStatsDF, queryFilters))
> }
> The content of queryReferencedColumns is name, and the content of 
> queryFilters is isnotnull(name#94) and (name#94 = a1). The 
> transposedColStatsDF is then based on the queryReferencedColumns to select 
> the corresponding column stats:
> ++--+-+-+--+
> |fileName|valueCount|name_minValue|name_maxValue|name_nullCount|
> ++--+-+-+--+
> |688f0d1e-527b-480...| 1|   a1|   a1| 0|
> |0851-faa4-48e...| 1|   a2|   a2| 0|
> ++--+-+-+--+{{}}
> Inside the getCandidateFiles function, indexSchema and indexFilter are 
> similar to the two parameters above, with the main difference being that in 
> indexFilter, isnotnull(name#94) is converted into ('name_nullCount < 
> 'valueCount). This judges that the name column is not null based on the 
> number of non-nulls being less than the total count. Thus, 
> prunedCandidateFileNames can correctly filter out the required files.
> private def getCandidateFiles(indexDf: DataFrame, queryFilters: 
> Seq[Expression]): Set[String] = \{
>   val indexSchema = indexDf.schema
>   val indexFilter = 
> queryFilters.map(translateIntoColumnStatsIndexFilterExpr(_, 
> indexSchema)).reduce(And)
>   val prunedCandidateFileNames =
> indexDf.where(new Column(indexFilter))
>   .select(HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME)
>   .collect()
>   .map(_.getString(0))
>   .toSet
>   val allIndexedFileNames =
> indexDf.select(HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME)
>   .collect()
>   .map(_.getString(0))
>   .toSet
>   val notIndexedFileNames = 
> lookupFileNamesMissingFromIndex(allIndexedFileNames)
>   prunedCandidateFileNames ++ notIndexedFileNames
> }{{}}
> h3. 2.Queries containing columns not supported by column stats Suppose our 
> query is adjusted to:
> select * from tbl where attributes.color = 'red'
> At this time, queryReferencedColumns is attributes, and queryFilters are 
> isnotnull(attributes#95) and (attributes#95[color] = red). Since 
> transposedColStatsDF does not have column stats for this column, it will be 
> empty. No matter what the query conditions are, prunedCandidateFileNames and 
> allIndexedFileNames in getCandidateFiles will be empty, hitting the logic of 
> notIndexedFileNames and returning all files. Thus, the query is still correct.
> h3. 3.Queries containing both columns supported and not supported by column 
>

[jira] [Updated] (HUDI-7246) Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7246:

Fix Version/s: 0.15.0
   1.0.0

> Data Skipping Issue: No Results When Query Conditions Involve Both Columns 
> with and without Column Stats
> 
>
> Key: HUDI-7246
> URL: https://issues.apache.org/jira/browse/HUDI-7246
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ma Jian
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> In the current code version, support for column stats has not yet been 
> extended to handle complex nested data types, such as map-type data 
> structures. Take the table tbl as an example, which is defined with three 
> fields: an integer field id, a string field name, and a map-type field 
> attributes. Within this table structure, the id and name fields support 
> column stats, and as such, HUDI will generate the corresponding column stats 
> indices for these two fields at the time of table creation. However, no 
> corresponding index will be generated for the attributes field. The specific 
> table creation statement is as follows:
> create table tbl (
>   id int,
>   name string,
>   attributes map
> ) ...
> To elaborate further, consider the following insert operation:
> insert into tbl values
> (1, 'a1', map('color', 'red', 'size', 'M')),
> (2, 'a2', map('color', 'blue', 'size', 'L'));After the execution of the 
> insert, the content of the column stats should be as follows:
> a.parquet   id min: 1   max: 1   null: 0
> b.parquet   id min: 2   max: 2   null: 0
> a.parquet   name   min: 'a1' max: 'a1' null: 0
> b.parquet   name   min: 'a2' max: 'a2' null: 0{{}}
> This means that there is no column stats index for the attributes column.
> Based on the table tbl, when we execute a query:
> h3. 1.Queries containing only columns supported by column stats:
> At this point, the data skipping code looks like this:
> columnStatsIndex.loadTransposed(queryReferencedColumns, shouldReadInMemory) 
> \{ transposedColStatsDF =>
>   Some(getCandidateFiles(transposedColStatsDF, queryFilters))
> }
> The content of queryReferencedColumns is name, and the content of 
> queryFilters is isnotnull(name#94) and (name#94 = a1). The 
> transposedColStatsDF is then based on the queryReferencedColumns to select 
> the corresponding column stats:
> ++--+-+-+--+
> |fileName|valueCount|name_minValue|name_maxValue|name_nullCount|
> ++--+-+-+--+
> |688f0d1e-527b-480...| 1|   a1|   a1| 0|
> |0851-faa4-48e...| 1|   a2|   a2| 0|
> ++--+-+-+--+{{}}
> Inside the getCandidateFiles function, indexSchema and indexFilter are 
> similar to the two parameters above, with the main difference being that in 
> indexFilter, isnotnull(name#94) is converted into ('name_nullCount < 
> 'valueCount). This judges that the name column is not null based on the 
> number of non-nulls being less than the total count. Thus, 
> prunedCandidateFileNames can correctly filter out the required files.
> private def getCandidateFiles(indexDf: DataFrame, queryFilters: 
> Seq[Expression]): Set[String] = \{
>   val indexSchema = indexDf.schema
>   val indexFilter = 
> queryFilters.map(translateIntoColumnStatsIndexFilterExpr(_, 
> indexSchema)).reduce(And)
>   val prunedCandidateFileNames =
> indexDf.where(new Column(indexFilter))
>   .select(HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME)
>   .collect()
>   .map(_.getString(0))
>   .toSet
>   val allIndexedFileNames =
> indexDf.select(HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME)
>   .collect()
>   .map(_.getString(0))
>   .toSet
>   val notIndexedFileNames = 
> lookupFileNamesMissingFromIndex(allIndexedFileNames)
>   prunedCandidateFileNames ++ notIndexedFileNames
> }{{}}
> h3. 2.Queries containing columns not supported by column stats Suppose our 
> query is adjusted to:
> select * from tbl where attributes.color = 'red'
> At this time, queryReferencedColumns is attributes, and queryFilters are 
> isnotnull(attributes#95) and (attributes#95[color] = red). Since 
> transposedColStatsDF does not have column stats for this column, it will be 
> empty. No matter what the query conditions are, prunedCandidateFileNames and 
> allIndexedFileNames in getCandidateFiles will be empty, hitting the logic of 
> notIndexedFileNames and returning all files. Thus, the query is still correct.
> h3. 3.Queries containing both columns supported

[jira] [Updated] (HUDI-7296) Reduce combinations for some tests to make ci faster

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7296:

Fix Version/s: 0.15.0
   1.0.0

> Reduce combinations for some tests to make ci faster
> 
>
> Key: HUDI-7296
> URL: https://issues.apache.org/jira/browse/HUDI-7296
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many 
> combinations of params. While it is good to test everything, there are lots 
> of code paths that have extensive duplicate testing. Reduce the number of 
> tests while still maintaining code coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7300:

Fix Version/s: 0.15.0
   1.0.0

> Parquet DFS source should support merging schemas
> -
>
> Key: HUDI-7300
> URL: https://issues.apache.org/jira/browse/HUDI-7300
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rohit Mittapalli
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should surface the option to merge schema across the parquet files in a 
> single commit. when using ParquetDFSSource.
>  
> When false the schema is randomly picked from a parquet file (current 
> behavior). When set to true the schema across a commit is merged.
>  
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7300:

Status: Patch Available  (was: In Progress)

> Parquet DFS source should support merging schemas
> -
>
> Key: HUDI-7300
> URL: https://issues.apache.org/jira/browse/HUDI-7300
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rohit Mittapalli
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should surface the option to merge schema across the parquet files in a 
> single commit. when using ParquetDFSSource.
>  
> When false the schema is randomly picked from a parquet file (current 
> behavior). When set to true the schema across a commit is merged.
>  
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7300) Parquet DFS source should support merging schemas

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7300.
---
Resolution: Fixed

> Parquet DFS source should support merging schemas
> -
>
> Key: HUDI-7300
> URL: https://issues.apache.org/jira/browse/HUDI-7300
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rohit Mittapalli
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should surface the option to merge schema across the parquet files in a 
> single commit. when using ParquetDFSSource.
>  
> When false the schema is randomly picked from a parquet file (current 
> behavior). When set to true the schema across a commit is merged.
>  
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6902) Detect and fix flaky tests

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6902:

Fix Version/s: 0.15.0
   1.0.0

> Detect and fix flaky tests
> --
>
> Key: HUDI-6902
> URL: https://issues.apache.org/jira/browse/HUDI-6902
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Step 1: Create a dummy PR and try to trigger the errors if possible.
> 1. The integration test constantly fails.
> 2. Some random failures: 
> [https://github.com/apache/hudi/actions/runs/6396038672]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7288) Fix ArrayIndexOutOfBoundsException when upgrade nonPartitionedTable created by 0.10/0.11 HUDI version

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7288.
---
Resolution: Fixed

> Fix ArrayIndexOutOfBoundsException when upgrade nonPartitionedTable created 
> by  0.10/0.11 HUDI version
> --
>
> Key: HUDI-7288
> URL: https://issues.apache.org/jira/browse/HUDI-7288
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: image-2024-01-10-18-56-19-523.png, 
> image-2024-01-10-18-56-47-184.png
>
>
> When upgrade a nonPartitionedTable which created by 0.10/0.11 HUDI version, 
> an ArrayIndexOutOfBoundsException would throw out.
> Because the hoodie.table.partition.fields is empty in hoodie.properties. BTW, 
> the empty config would be filtered out from hoodie.properties file since 0.12 
> version by [PR#6821|https://github.com/apache/hudi/pull/6821]
>  !image-2024-01-10-18-56-47-184.png! 
>  !image-2024-01-10-18-56-19-523.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7293) Incremental read of insert table using rebalance strategy

2024-06-06 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7293:

Fix Version/s: 0.15.0

> Incremental read of insert table using rebalance strategy
> -
>
> Key: HUDI-7293
> URL: https://issues.apache.org/jira/browse/HUDI-7293
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: image-2024-01-11-22-47-03-463.png, 
> image-2024-01-11-22-50-09-512.png
>
>
> For insert type tables, we do not need to use keyby to distribute inputsplit 
> to avoid data skewing issues with the split reader operator
> !image-2024-01-11-22-50-09-512.png|width=606,height=197!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 >

1 - 100 of 229 matches

Mail list logo