[PR] [MINOR] add logger to CompactionPlanOperator & ClusteringPlanOperator [hudi]

2024-01-24 Thread via GitHub


eric9204 opened a new pull request, #10562:
URL: https://github.com/apache/hudi/pull/10562

   ### Change Logs
   
   None
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7218] Integrate new HFile reader with file reader factory [hudi]

2024-01-24 Thread via GitHub


nsivabalan commented on code in PR #10330:
URL: https://github.com/apache/hudi/pull/10330#discussion_r1465901347


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##
@@ -29,6 +29,13 @@
 groupName = ConfigGroups.Names.READER,
 description = "Configurations that control file group reading.")
 public class HoodieReaderConfig extends HoodieConfig {
+  public static final ConfigProperty USE_BUILT_IN_HFILE_READER = 
ConfigProperty
+  .key("hoodie.hfile.use.built.in.reader")

Review Comment:
   is this meant to be deprecated in near future and it not really expected to 
be used by end user? then should we consider prefixing with "_"



##
hudi-common/src/main/java/org/apache/hudi/common/bloom/HoodieDynamicBoundedBloomFilter.java:
##
@@ -64,14 +66,17 @@ public class HoodieDynamicBoundedBloomFilter implements 
BloomFilter {
   public HoodieDynamicBoundedBloomFilter(String serString) {
 // ignoring the type code for now, since we have just one version
 byte[] bytes = Base64CodecUtil.decode(serString);
-DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytes));
-try {
-  internalDynamicBloomFilter = new InternalDynamicBloomFilter();
-  internalDynamicBloomFilter.readFields(dis);
-  dis.close();
-} catch (IOException e) {
-  throw new HoodieIndexException("Could not deserialize BloomFilter 
instance", e);
-}
+extractAndSetInternalBloomFilter(new DataInputStream(new 
ByteArrayInputStream(bytes)));

Review Comment:
   is it possible to do try with resource design here 



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -211,9 +233,10 @@ protected  ClosableIterator> 
lookupRecords(List sorte
 blockContentLoc.getContentPositionInLogFile(),
 blockContentLoc.getBlockSize());
 
-try (final HoodieAvroHFileReader reader =
- new HoodieAvroHFileReader(inlineConf, inlinePath, new 
CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
- Option.of(getSchemaFromHeader( {
+try (final BaseHoodieAvroHFileReader reader = (BaseHoodieAvroHFileReader)

Review Comment:
   can we try to thinkg of better naming for HoodieAvroFileReaderBase and 
BaseHoodieAvroHFileReader?
   do you think we can rename BaseHoodieAvroHFileReader to 
HoodieAvroHFileReaderImplBase or HoodieAvroHFileReaderBaseImpl 
   
   



##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##
@@ -728,42 +464,100 @@ public IndexedRecord next() {
 @Override
 public void close() {
   try {
-scanner.close();
 reader.close();
   } catch (IOException e) {
-throw new HoodieIOException("Error closing the hfile reader and 
scanner", e);
+throw new HoodieIOException("Error closing the HFile reader and 
scanner", e);
   }
 }
-  }
 
-  static class SeekableByteArrayInputStream extends 
ByteBufferBackedInputStream implements Seekable, PositionedReadable {
-public SeekableByteArrayInputStream(byte[] buf) {
-  super(buf);
-}
+private static Iterator 
getRecordByKeyPrefixIteratorInternal(HFileReader reader,

Review Comment:
   I see lot of code duplication across  HoodieAvroHBaseHFileReader and 
HoodieAvroHFileReader. for eg, RecordByKeyPrefixIterator, RecordByKeyIterator. 
can we try to fix them and re-use code 



##
hudi-common/src/main/java/org/apache/hudi/common/bloom/SimpleBloomFilter.java:
##
@@ -138,4 +144,12 @@ public BloomFilterTypeCode getBloomFilterTypeCode() {
 return BloomFilterTypeCode.SIMPLE;
   }
 
+  private void extractAndSetInternalBloomFilter(DataInputStream dis) {
+try {
+  this.filter.readFields(dis);
+  dis.close();

Review Comment:
   same comment as above 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -182,7 +183,7 @@ public static List> 
filterKeysFromFile(Path filePath, List> foundRecordKeys = new ArrayList<>();
 try (HoodieFileReader fileReader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO)
-.getFileReader(configuration, filePath)) {
+.getFileReader(new HoodieConfig(), configuration, filePath)) {

Review Comment:
   if its an empty one always, should we declare a singleton instance and 
re-use wherever required? you can name it DEFAULT_HUDI_CONFIG_FOR_READER



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -83,19 +89,29 @@ public HoodieHFileDataBlock(FSDataInputStream inputStream,
   Map header,
   Map footer,
   boolean enablePointLookups,
-  Path pathForReader) {
-super(content, inputStream, readBlockLazily, 
Option.of(logBlockContentLocation), readerSc

Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10554:
URL: https://github.com/apache/hudi/pull/10554#issuecomment-1909446353

   
   ## CI report:
   
   * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN
   * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144)
 
   * 58eaae52e37c5be3354346cab9a4f22769ff8129 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22158)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10554:
URL: https://github.com/apache/hudi/pull/10554#issuecomment-1909439147

   
   ## CI report:
   
   * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN
   * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144)
 
   * 58eaae52e37c5be3354346cab9a4f22769ff8129 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909432153

   
   ## CI report:
   
   * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


xuzifu666 commented on code in PR #10554:
URL: https://github.com/apache/hudi/pull/10554#discussion_r1465887278


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala:
##
@@ -2160,6 +2174,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
|union
|select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 
12 as `hour`
|""".stripMargin)
+  val stageClassName = classOf[HoodieSparkEngineContext].getSimpleName
+  spark.sparkContext.addSparkListener(new 
StageParallelismListener(stageName = stageClassName))

Review Comment:
   @bvaradar OK,had add a counter to assert called atleast once. PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Issue with reading the debezium inputs [hudi]

2024-01-24 Thread via GitHub


zyperd opened a new issue, #10561:
URL: https://github.com/apache/hudi/issues/10561

   
   When hudi is reading the debezium ingested topics the following error 
message is displayed, kindly help to identify the issue
   
   ```
   Caused by: java.lang.NoSuchMethodException: 
org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource.(org.apache.hudi.common.config.TypedProperties,org.apache.spark.api.java.JavaSparkContext,org.apache.spark.sql.SparkSession,org.apache.hudi.utilities.schema.SchemaProvider)```
   
   In the source 
   public MysqlDebeziumSource(TypedProperties props, JavaSparkContext 
sparkContext,
SparkSession sparkSession,
SchemaProvider schemaProvider,
HoodieIngestionMetrics metrics)
   Is the spark-submit command missing any hudi config?
   hudi-aws-bundle.jar -> hudi-utilities-bundle_2.12-0.14.0-amzn-1.jar
   ```
   
   
   ```
   spark-submit \
 --master yarn \
  --deploy-mode cluster \
 --driver-memory 2g --executor-memory 1g --num-executors 1 
--executor-cores 1 \
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
 --conf spark.sql.catalogImplementation=hive \
 --conf spark.driver.maxResultSize=1g \
 --conf spark.speculation=true \
 --conf spark.speculation.multiplier=1.0 \
 --conf spark.speculation.quantile=0.5 \
 --conf spark.ui.port=6680 \
 --conf spark.eventLog.dir=s3://spark_events/ \
 --conf spark.eventLog.enabled=true \
 --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
 --conf spark.scheduler.mode=FAIR \
 --jars 
/usr/lib/hudi/hudi-aws-bundle.jar,/home/hadoop/kafka-avro-serializer-3.1.1.jar \
 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar \
 --target-base-path s3://mysql_cdc/table_cdc/ \
 --source-class 
org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource \
 --payload-class 
org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload \
 --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider  \
 --source-ordering-field id \
 --target-table table_cdc \
 --table-type COPY_ON_WRITE \
 --op UPSERT \
 --enable-hive-sync \
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \
 --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 \
 --hoodie-conf auto.offset.reset=earliest \
 --hoodie-conf bootstrap.servers=127.0.0.1:9002 \
 --hoodie-conf hoodie.deltastreamer.source.kafka.topic="table_cdc" \
 --hoodie-conf 
hoodie.deltastreamer.source.kafka.value.deserializer.class=org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer
 \
 --hoodie-conf hoodie.datasource.hive_sync.enable=true \
 --hoodie-conf hoodie.datasource.hive_sync.database=default \
 --hoodie-conf hoodie.datasource.hive_sync.table=table_cdc \
 --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
 --hoodie-conf hoodie.datasource.write.recordkey.field=id \
 --hoodie-conf hoodie.datasource.write.partitionpath.field=value_type \
 --hoodie-conf 
hoodie.compaction.payload.class=org.apache.hudi.common.model.DebeziumAvroPayload
  \
 --hoodie-conf hoodie.table.name=table_cdc \
 --hoodie-conf 
hoodie.streamer.schemaprovider.source.schema.file=file:///source.avsc \
 --hoodie-conf 
hoodie.streamer.schemaprovider.target.schema.file=file:///target.avsc \
 --hoodie-conf hoodie.datasource.hive_sync.partition_fields=value_type \
 --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \
 --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://cdc-events/ \
 --hoodie-conf hoodie.datasource.hive_sync.mode=hms
   ```
   
   #source.avsc
   ```
   {
 "type": "record",
 "name": "ChangeEvent",
 "fields": [
   {
 "name": "before",
 "type": ["null", "string"]
   },
   {
 "name": "after",
 "type": {
   "type": "record",
   "name": "After",
   "fields": [
 { "name": "id", "type": ["int"] },
 { "name": "values", "type": "string" },
 { "name": "value_type", "type": "string" },
   ]
 }
   },
   {
 "name": "source",
 "type": {
   "type": "record",
   "name": "Source",
   "fields": [
 { "name": "version", "type": ["null", "string"] },
 { "name": "connector", "type": ["null", "string"] },
 { "name": "name", "type": ["null", "string"] },
 { "name": "ts_ms", "type": ["null", "long"] },
 { "name": "snapshot", "t

Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


yihua merged PR #10549:
URL: https://github.com/apache/hudi/pull/10549


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (11861c8a50e -> f2b24a149c1)

2024-01-24 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 11861c8a50e [HUDI-7298] Write bad records to error table in more cases 
instead of failing stream (#10500)
 add f2b24a149c1 [HUDI-7323] Use a schema supplier instead of a static 
value (#10549)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/utilities/UtilHelpers.java |  7 +++---
 .../apache/hudi/utilities/streamer/StreamSync.java | 15 +--
 .../utilities/transform/ChainedTransformer.java| 12 +
 .../ErrorTableAwareChainedTransformer.java |  5 ++--
 .../functional/TestChainedTransformer.java | 29 +++---
 .../TestErrorTableAwareChainedTransformer.java |  4 +--
 6 files changed, 48 insertions(+), 24 deletions(-)



Re: [I] [SUPPORT] After upgrading hudi 0.14.1, use Spark SQL merge into to update the matched_action, the case of the column name and the expression name does not match, resulting in an exception. [hu

2024-01-24 Thread via GitHub


yihao-tcf commented on issue #10558:
URL: https://github.com/apache/hudi/issues/10558#issuecomment-1909366472

   > @yihao-tcf @jonvex hi, any plan fix it? If not, I can try to fix it
   
   @KnightChess I don't have any plans here. Thank you for fixing this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]

2024-01-24 Thread via GitHub


nsivabalan commented on code in PR #10238:
URL: https://github.com/apache/hudi/pull/10238#discussion_r1465848613


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -652,7 +653,7 @@ public static HoodieData 
convertMetadataToColumnStatsRecords(Hoodi
   String partitionPath = deleteFileInfoPair.getLeft();
   String filePath = deleteFileInfoPair.getRight();
 
-  if (filePath.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
+  if (filePath.endsWith(HoodieFileFormat.PARQUET.getFileExtension()) 
|| ExternalFilePathUtil.isExternallyCreatedFile(filePath)) {

Review Comment:
   I guess there is some gap here wrt log files. 
   for log files, we get stats directly from Append Handle and entries are 
added to col stats using log file name. 
   so, during clean commit metadata, we should be deleting both data files and 
log files. 
   
   https://issues.apache.org/jira/browse/HUDI-7331
   
   I have created a follow up ticket on this. lets fix it thoroughly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7331) Test and certify col stats integration with MOR table

2024-01-24 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7331:
-

 Summary: Test and certify col stats integration with MOR table
 Key: HUDI-7331
 URL: https://issues.apache.org/jira/browse/HUDI-7331
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


Lets test and certify col stats integration with MOR table for all operations.

for eg, any write operations (bulk insert, insert, upsert, insert overwrite) 
should add new entries to col stats index in metadata table. 

rollback: 

for files that were deleted should be removed from col stats (data files). 

for log files added, we should add new entries to col stats 

 

clean: 

any files deleted (data files and log files) should have the entries removed 
from col stats in MDT. 

 

Similarly, lets also do similar exercise with delete partition and other 
operations we have with hudi. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1465827024


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.KeySpec;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieCorruptedDataException;
+import org.apache.hudi.exception.HoodieKeyException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.avro.Schema;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public abstract class HoodieBaseFileGroupRecordBuffer implements 
HoodieFileGroupRecordBuffer {
+  protected final HoodieReaderContext readerContext;
+  protected final Schema readerSchema;
+  protected final Schema baseFileSchema;
+  protected final Option partitionNameOverrideOpt;
+  protected final Option partitionPathFieldOpt;
+  protected final HoodieRecordMerger recordMerger;
+  protected final TypedProperties payloadProps;
+  protected final HoodieTableMetaClient hoodieTableMetaClient;
+  protected final Map, Map>> records;
+  protected ClosableIterator baseFileIterator;
+  protected Iterator, Map>> logRecordIterator;
+  protected T nextRecord;
+
+  public HoodieBaseFileGroupRecordBuffer(HoodieReaderContext readerContext,
+ Schema readerSchema,
+ Schema baseFileSchema,
+ Option 
partitionNameOverrideOpt,
+ Option 
partitionPathFieldOpt,
+ HoodieRecordMerger recordMerger,
+ TypedProperties payloadProps,
+ HoodieTableMetaClient 
hoodieTableMetaClient) {
+this.readerContext = readerContext;
+this.readerSchema = readerSchema;
+this.baseFileSchema = baseFileSchema;
+this.partitionNameOverrideOpt = partitionNameOverrideOpt;
+this.partitionPathFieldOpt = partitionPathFieldOpt;
+this.recordMerger = recordMerger;
+this.payloadProps = payloadProps;
+this.hoodieTableMetaClient = hoodieTableMetaClient;
+this.records = new HashMap<>();

Review Comment:
   The sequence of the log records got lost by using the `HashMap#values`, we 
should fix it. And in general, should we cache all the log records in memory, I 
don't think it is reasonable, we should use spillable map here.
   
   And we also needs to support unmerged log reader for streaming read 
scenarios, for this case, we should not buffer the log records actually. The 
log read sequence should be ensured too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] new hudi content for 01-2024 [hudi]

2024-01-24 Thread via GitHub


nfarah86 opened a new pull request, #10560:
URL: https://github.com/apache/hudi/pull/10560

   new pr content for hudi blogs cc @bhasudha 
   
   https://github.com/apache/hudi/assets/5392555/00daee1b-bb2a-4850-b155-619a3c2a3383";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1465827024


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.KeySpec;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieCorruptedDataException;
+import org.apache.hudi.exception.HoodieKeyException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.avro.Schema;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public abstract class HoodieBaseFileGroupRecordBuffer implements 
HoodieFileGroupRecordBuffer {
+  protected final HoodieReaderContext readerContext;
+  protected final Schema readerSchema;
+  protected final Schema baseFileSchema;
+  protected final Option partitionNameOverrideOpt;
+  protected final Option partitionPathFieldOpt;
+  protected final HoodieRecordMerger recordMerger;
+  protected final TypedProperties payloadProps;
+  protected final HoodieTableMetaClient hoodieTableMetaClient;
+  protected final Map, Map>> records;
+  protected ClosableIterator baseFileIterator;
+  protected Iterator, Map>> logRecordIterator;
+  protected T nextRecord;
+
+  public HoodieBaseFileGroupRecordBuffer(HoodieReaderContext readerContext,
+ Schema readerSchema,
+ Schema baseFileSchema,
+ Option 
partitionNameOverrideOpt,
+ Option 
partitionPathFieldOpt,
+ HoodieRecordMerger recordMerger,
+ TypedProperties payloadProps,
+ HoodieTableMetaClient 
hoodieTableMetaClient) {
+this.readerContext = readerContext;
+this.readerSchema = readerSchema;
+this.baseFileSchema = baseFileSchema;
+this.partitionNameOverrideOpt = partitionNameOverrideOpt;
+this.partitionPathFieldOpt = partitionPathFieldOpt;
+this.recordMerger = recordMerger;
+this.payloadProps = payloadProps;
+this.hoodieTableMetaClient = hoodieTableMetaClient;
+this.records = new HashMap<>();

Review Comment:
   The sequence of the log records got lost by using the `HashMap#values`, we 
should fix it. And in general, should we cache all the log records in memory, I 
don't think it is reasonable, when the base file is empty, it is feasible we 
just keep an iterator of the log files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909332258

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN
   * 74b8a6658f324313bec3525aae40a3203a8c6bc1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7215) Delete NewHoodieParquetFileFormat and all references

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-7215.
-
Resolution: Fixed

> Delete NewHoodieParquetFileFormat and all references
> 
>
> Key: HUDI-7215
> URL: https://issues.apache.org/jira/browse/HUDI-7215
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> HoodieFileGroupReaderBasedParquetFileFormat now has feature parity with 
> NewHoodieParquetFileFormat and no new work will be done on 
> NewHoodieParquetFileFormat. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7244) Ensure ClosableIterator is propagated all the way to FileScanRDD

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-7244.
-
Resolution: Fixed

> Ensure ClosableIterator is propagated all the way to FileScanRDD
> 
>
> Key: HUDI-7244
> URL: https://issues.apache.org/jira/browse/HUDI-7244
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
>
> CI tests are OOMing. One cause is that resources are not being freed from the 
> new filegroup reader. After some code inspection, it was found that close is 
> not being called in the HoodieFileGroupReaderIterator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7045:
--
Status: In Progress  (was: Open)

> Fix new file format and reader for schema evolution
> ---
>
> Key: HUDI-7045
> URL: https://issues.apache.org/jira/browse/HUDI-7045
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> When this is implemented, parquet readers should not be created in 
> HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can 
> uncomment/add the code from this commit: 
> [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7296) Reduce combinations for some tests to make ci faster

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-7296.
-
Resolution: Fixed

> Reduce combinations for some tests to make ci faster
> 
>
> Key: HUDI-7296
> URL: https://issues.apache.org/jira/browse/HUDI-7296
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many 
> combinations of params. While it is good to test everything, there are lots 
> of code paths that have extensive duplicate testing. Reduce the number of 
> tests while still maintaining code coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6787:
--
Status: Patch Available  (was: In Progress)

> Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> -
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6872) Simplify Out Of Box Schema Evolution Functionality

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-6872.
-
Resolution: Fixed

> Simplify Out Of Box Schema Evolution Functionality
> --
>
> Key: HUDI-6872
> URL: https://issues.apache.org/jira/browse/HUDI-6872
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Test schema evolution capabilities out of the box for deltastreamer and 
> datasource. Make schema evolution out of the box easy to understand and use



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7327) hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work with HoodieIncrSource unless meta cols are dropped

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7327:
--
Status: Patch Available  (was: In Progress)

> hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work 
> with HoodieIncrSource unless meta cols are dropped
> --
>
> Key: HUDI-7327
> URL: https://issues.apache.org/jira/browse/HUDI-7327
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> The incoming meta cols are treated as new columns which is not allowed by 
> internalschema so it fails



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7298) Write bad records to error table in more cases instead of failing stream

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-7298.
-
Resolution: Fixed

> Write bad records to error table in more cases instead of failing stream
> 
>
> Key: HUDI-7298
> URL: https://issues.apache.org/jira/browse/HUDI-7298
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
>
> If no transformer is used, but schema provider is used, records with the 
> incorrect schema will not be detected and will fail the stream during 
> HoodieRecord creation. Additionally, during keygeneration the stream can 
> crash if required fields are null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7327) hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work with HoodieIncrSource unless meta cols are dropped

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7327:
--
Status: In Progress  (was: Open)

> hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work 
> with HoodieIncrSource unless meta cols are dropped
> --
>
> Key: HUDI-7327
> URL: https://issues.apache.org/jira/browse/HUDI-7327
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> The incoming meta cols are treated as new columns which is not allowed by 
> internalschema so it fails



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7298] Write bad records to error table in more cases instead of failing stream (#10500)

2024-01-24 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 11861c8a50e [HUDI-7298] Write bad records to error table in more cases 
instead of failing stream (#10500)
11861c8a50e is described below

commit 11861c8a50e7dd23186d44bdc7aef871e5fc1280
Author: Jon Vexler 
AuthorDate: Wed Jan 24 22:59:29 2024 -0500

[HUDI-7298] Write bad records to error table in more cases instead of 
failing stream (#10500)

Cases:
- No transformers, with schema provider. Records will go to the error table 
if they cannot be rewritten in the deduced schema.
- recordkey is null, even if the column is nullable in the schema
---
 .../apache/hudi/config/HoodieErrorTableConfig.java |   6 ++
 .../scala/org/apache/hudi/HoodieSparkUtils.scala   |  21 +
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  33 ++-
 .../org/apache/hudi/TestHoodieSparkUtils.scala |   4 +
 .../apache/hudi/utilities/streamer/ErrorEvent.java |   6 +-
 .../utilities/streamer/HoodieStreamerUtils.java|  68 ++
 .../apache/hudi/utilities/streamer/StreamSync.java |  19 +++-
 ...TestHoodieDeltaStreamerSchemaEvolutionBase.java |  63 +
 ...oodieDeltaStreamerSchemaEvolutionExtensive.java | 100 +++--
 ...estHoodieDeltaStreamerSchemaEvolutionQuick.java |  18 ++--
 .../utilities/sources/TestGenericRddTransform.java |  29 ++
 .../schema-evolution/testMissingRecordKey.json |   2 +
 12 files changed, 334 insertions(+), 35 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java
index 68e2097c33b..8ba013b00ee 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java
@@ -72,6 +72,12 @@ public class HoodieErrorTableConfig {
   .defaultValue(false)
   .withDocumentation("Records with schema mismatch with Target Schema are 
sent to Error Table.");
 
+  public static final ConfigProperty 
ERROR_ENABLE_VALIDATE_RECORD_CREATION = ConfigProperty
+  .key("hoodie.errortable.validate.recordcreation.enable")
+  .defaultValue(true)
+  .sinceVersion("0.14.2")
+  .withDocumentation("Records that fail to be created due to keygeneration 
failure or other issues will be sent to the Error Table");
+
   public static final ConfigProperty 
ERROR_TABLE_WRITE_FAILURE_STRATEGY = ConfigProperty
   .key("hoodie.errortable.write.failure.strategy")
   .defaultValue(ErrorWriteFailureStrategy.ROLLBACK_COMMIT.name())
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
index 527864fcf24..535af8db193 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
@@ -199,6 +199,27 @@ object HoodieSparkUtils extends SparkAdapterSupport with 
SparkVersionsSupport wi
 }
   }
 
+  /**
+   * Rerwite the record into the target schema.
+   * Return tuple of rewritten records and records that could not be converted
+   */
+  def safeRewriteRDD(df: RDD[GenericRecord], serializedTargetSchema: String): 
Tuple2[RDD[GenericRecord], RDD[String]] = {
+val rdds: RDD[Either[GenericRecord, String]] = df.mapPartitions { recs =>
+  if (recs.isEmpty) {
+Iterator.empty
+  } else {
+val schema = new Schema.Parser().parse(serializedTargetSchema)
+val transform: GenericRecord => Either[GenericRecord, String] = record 
=> try {
+  Left(HoodieAvroUtils.rewriteRecordDeep(record, schema, true))
+} catch {
+  case _: Throwable => Right(HoodieAvroUtils.avroToJsonString(record, 
false))
+}
+recs.map(transform)
+  }
+}
+(rdds.filter(_.isLeft).map(_.left.get), 
rdds.filter(_.isRight).map(_.right.get))
+  }
+
   def getCatalystRowSerDe(structType: StructType): SparkRowSerDe = {
 sparkAdapter.createSparkRowSerDe(structType)
   }
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
index ac7dcd42979..9b925eb59be 100644
--- a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
@@ -187,6 +187,16 @@ public class HoodieAvroUtils {
 }
   }
 
+  /**
+   * Convert a given avro record to json and return the string
+   *
+   * @param record The GenericRecord to convert
+   * @param pretty Whether to

Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]

2024-01-24 Thread via GitHub


codope merged PR #10500:
URL: https://github.com/apache/hudi/pull/10500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909294966

   
   ## CI report:
   
   * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145)
 
   * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909290294

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 7476839c8fde914ff1e201af11f591f46fec392e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153)
 
   * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN
   * 74b8a6658f324313bec3525aae40a3203a8c6bc1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909290227

   
   ## CI report:
   
   * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145)
 
   * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909284023

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 7476839c8fde914ff1e201af11f591f46fec392e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153)
 
   * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN
   * 74b8a6658f324313bec3525aae40a3203a8c6bc1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10223:
URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909283810

   
   ## CI report:
   
   * d197ce8180f3f11e30d2254733c46f137e12376c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22152)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on code in PR #10344:
URL: https://github.com/apache/hudi/pull/10344#discussion_r1465787117


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##
@@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator) throws 
IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, DiskMapType.BITCASK);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType) throws IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, diskMapType, false);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType, boolean isCompressionEnabled) throws IOException {
 this.inMemoryMap = new HashMap<>();
 this.baseFilePath = baseFilePath;
-this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
sizingFactorForInMemoryMap);
+this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
SIZING_FACTOR_FOR_IN_MEMORY_MAP);
 this.currentInMemoryMapSize = 0L;
 this.keySizeEstimator = keySizeEstimator;
 this.valueSizeEstimator = valueSizeEstimator;
 this.diskMapType = diskMapType;
 this.isCompressionEnabled = isCompressionEnabled;
   }
 
+  private DiskMap getDiskBasedMap() {
+return getDiskBasedMap(false);
+  }
+
+  private DiskMap getOrCreateDiskBasedMap() {
+return getDiskBasedMap(true);
+  }
+
   private DiskMap getDiskBasedMap(boolean forceInitialization) {
 if (null == diskBasedMap) {
-  if (!forceInitialization) {
-return DiskMap.empty();
-  }
   synchronized (this) {
 if (null == diskBasedMap) {
+  if (!forceInitialization) {
+return DiskMap.empty();

Review Comment:
   > We can avoid the dummy empty map by also embedding null checks into all of 
the methods
   
   Somehow makes sense, I just thought it might be more straight-forward to do 
that in specific map impls.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


the-other-tim-brown commented on code in PR #10344:
URL: https://github.com/apache/hudi/pull/10344#discussion_r1465785310


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##
@@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator) throws 
IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, DiskMapType.BITCASK);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType) throws IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, diskMapType, false);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType, boolean isCompressionEnabled) throws IOException {
 this.inMemoryMap = new HashMap<>();
 this.baseFilePath = baseFilePath;
-this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
sizingFactorForInMemoryMap);
+this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
SIZING_FACTOR_FOR_IN_MEMORY_MAP);
 this.currentInMemoryMapSize = 0L;
 this.keySizeEstimator = keySizeEstimator;
 this.valueSizeEstimator = valueSizeEstimator;
 this.diskMapType = diskMapType;
 this.isCompressionEnabled = isCompressionEnabled;
   }
 
+  private DiskMap getDiskBasedMap() {
+return getDiskBasedMap(false);
+  }
+
+  private DiskMap getOrCreateDiskBasedMap() {
+return getDiskBasedMap(true);
+  }
+
   private DiskMap getDiskBasedMap(boolean forceInitialization) {
 if (null == diskBasedMap) {
-  if (!forceInitialization) {
-return DiskMap.empty();
-  }
   synchronized (this) {
 if (null == diskBasedMap) {
+  if (!forceInitialization) {
+return DiskMap.empty();

Review Comment:
   At some point you will need to pay special attention to whether the 
read/write methods are correct. Right now we are just debating about where that 
is, is that correct?
   
   In my opinion, the ExternalSpillableMap is the logical place for handling 
the logic of initializing the disk map if it is needed. We can avoid the dummy 
empty map by also embedding null checks into all of the methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on code in PR #10344:
URL: https://github.com/apache/hudi/pull/10344#discussion_r1465781096


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##
@@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator) throws 
IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, DiskMapType.BITCASK);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType) throws IOException {
 this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, 
valueSizeEstimator, diskMapType, false);
   }
 
-  public ExternalSpillableMap(Long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
+  public ExternalSpillableMap(long maxInMemorySizeInBytes, String 
baseFilePath, SizeEstimator keySizeEstimator,
   SizeEstimator valueSizeEstimator, DiskMapType 
diskMapType, boolean isCompressionEnabled) throws IOException {
 this.inMemoryMap = new HashMap<>();
 this.baseFilePath = baseFilePath;
-this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
sizingFactorForInMemoryMap);
+this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * 
SIZING_FACTOR_FOR_IN_MEMORY_MAP);
 this.currentInMemoryMapSize = 0L;
 this.keySizeEstimator = keySizeEstimator;
 this.valueSizeEstimator = valueSizeEstimator;
 this.diskMapType = diskMapType;
 this.isCompressionEnabled = isCompressionEnabled;
   }
 
+  private DiskMap getDiskBasedMap() {
+return getDiskBasedMap(false);
+  }
+
+  private DiskMap getOrCreateDiskBasedMap() {
+return getDiskBasedMap(true);
+  }
+
   private DiskMap getDiskBasedMap(boolean forceInitialization) {
 if (null == diskBasedMap) {
-  if (!forceInitialization) {
-return DiskMap.empty();
-  }
   synchronized (this) {
 if (null == diskBasedMap) {
+  if (!forceInitialization) {
+return DiskMap.empty();

Review Comment:
   I'm wondering if we can make the two map implementations initiaze lazily by 
themselves so that in this ExternalSpillableMap there is no need to pay special 
attention to make the read/write mehods as corrrect, and there is no need to 
introduce the dummy empty map.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]

2024-01-24 Thread via GitHub


jonvex commented on PR #10500:
URL: https://github.com/apache/hudi/pull/10500#issuecomment-1909253547

   azure ci passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7218] Integrate new HFile reader with file reader factory [hudi]

2024-01-24 Thread via GitHub


vinothchandar commented on code in PR #10330:
URL: https://github.com/apache/hudi/pull/10330#discussion_r1459536068


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -182,7 +183,7 @@ public static List> 
filterKeysFromFile(Path filePath, List> foundRecordKeys = new ArrayList<>();
 try (HoodieFileReader fileReader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO)
-.getFileReader(configuration, filePath)) {
+.getFileReader(new HoodieConfig(), configuration, filePath)) {

Review Comment:
   this feels a little odd to be passing in an empty properties list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909247856

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 7476839c8fde914ff1e201af11f591f46fec392e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153)
 
   * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10556:
URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909242727

   
   ## CI report:
   
   * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22151)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909198297

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 7476839c8fde914ff1e201af11f591f46fec392e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909164123

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149)
 
   * 7476839c8fde914ff1e201af11f591f46fec392e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10223:
URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909157450

   
   ## CI report:
   
   * 157fb0e8df7b87579ca64a2d3a64212675baf644 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21923)
 
   * d197ce8180f3f11e30d2254733c46f137e12376c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22152)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10549:
URL: https://github.com/apache/hudi/pull/10549#issuecomment-1909151475

   
   ## CI report:
   
   * ca627db36503a81c4223edde799bd344b9cf2b05 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10223:
URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909150913

   
   ## CI report:
   
   * 157fb0e8df7b87579ca64a2d3a64212675baf644 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21923)
 
   * d197ce8180f3f11e30d2254733c46f137e12376c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909143577

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149)
 
   * 7476839c8fde914ff1e201af11f591f46fec392e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi DeltaStreamer with Flattening Transformer [hudi]

2024-01-24 Thread via GitHub


soumilshah1995 closed issue #10499: [SUPPORT] Hudi DeltaStreamer with 
Flattening Transformer
URL: https://github.com/apache/hudi/issues/10499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi DeltaStreamer with Flattening Transformer [hudi]

2024-01-24 Thread via GitHub


soumilshah1995 commented on issue #10499:
URL: https://github.com/apache/hudi/issues/10499#issuecomment-1909122022

   I would need some time to play with flattening transformer 
   need to setup a test project to see if works 
   let me close this and reopen it later again as I would be doing these test 
most likely next week 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10556:
URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909106935

   
   ## CI report:
   
   * fd66a2b6c21e32cc340e3a813acf826dc83b3547 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22131)
 
   * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22151)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10556:
URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909099807

   
   ## CI report:
   
   * fd66a2b6c21e32cc340e3a813acf826dc83b3547 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22131)
 
   * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909089865

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (77833cdb096 -> a83f7c03836)

2024-01-24 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 77833cdb096 [HUDI-7311] Add implicit literal type conversion before 
filter push down (#10531)
 add a83f7c03836 [HUDI-7228] Fix eager closure of log reader input streams 
with log record reader (#10340)

No new revisions were added by this update.

Summary of changes:
 .../utils/LegacyArchivedMetaEntryReader.java   |  7 +++
 .../hudi/common/table/log/HoodieLogFileReader.java | 52 +-
 .../common/table/log/HoodieLogFormatReader.java| 32 +++--
 .../table/log/block/HoodieAvroDataBlock.java   |  5 ++-
 .../common/table/log/block/HoodieCDCDataBlock.java |  5 ++-
 .../common/table/log/block/HoodieCommandBlock.java |  5 ++-
 .../common/table/log/block/HoodieCorruptBlock.java |  5 ++-
 .../common/table/log/block/HoodieDataBlock.java|  5 ++-
 .../common/table/log/block/HoodieDeleteBlock.java  |  9 ++--
 .../table/log/block/HoodieHFileDataBlock.java  |  5 ++-
 .../common/table/log/block/HoodieLogBlock.java | 11 ++---
 .../table/log/block/HoodieParquetDataBlock.java|  5 ++-
 12 files changed, 65 insertions(+), 81 deletions(-)



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2024-01-24 Thread via GitHub


bvaradar merged PR #10340:
URL: https://github.com/apache/hudi/pull/10340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909026446

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * b78aacdea8818d79256550e0ca2f2bb32708811e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22135)
 
   * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]

2024-01-24 Thread via GitHub


jonvex commented on code in PR #10556:
URL: https://github.com/apache/hudi/pull/10556#discussion_r1465590724


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java:
##
@@ -609,6 +610,7 @@ static HoodieDeltaStreamer.Config 
makeConfigForHudiIncrSrc(String srcBasePath, S
 cfg.schemaProviderClassName = schemaProviderClassName;
   }
   List cfgs = new ArrayList<>();
+  cfgs.add(HANDLE_MISSING_COLUMNS_WITH_LOSSLESS_TYPE_PROMOTIONS.key() + 
"=true");

Review Comment:
   yes. Without the change the stream sync line 664 it fails



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10549:
URL: https://github.com/apache/hudi/pull/10549#issuecomment-1908955641

   
   ## CI report:
   
   * ee8ed782107e9ef4aa7ebe50fa22fc68c6c14602 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22104)
 
   * ca627db36503a81c4223edde799bd344b9cf2b05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10360:
URL: https://github.com/apache/hudi/pull/10360#issuecomment-1908955169

   
   ## CI report:
   
   * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN
   * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN
   * b78aacdea8818d79256550e0ca2f2bb32708811e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22135)
 
   * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10549:
URL: https://github.com/apache/hudi/pull/10549#issuecomment-1908945912

   
   ## CI report:
   
   * ee8ed782107e9ef4aa7ebe50fa22fc68c6c14602 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22104)
 
   * ca627db36503a81c4223edde799bd344b9cf2b05 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


the-other-tim-brown commented on code in PR #10549:
URL: https://github.com/apache/hudi/pull/10549#discussion_r1465518543


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -120,6 +121,7 @@ private void validateIdentifier(String id, Set 
identifiers, String confi
 
   private StructType getExpectedTransformedSchema(TransformerInfo 
transformerInfo, JavaSparkContext jsc, SparkSession sparkSession,
   Option 
incomingStructOpt, Option> rowDatasetOpt, TypedProperties 
properties) {
+Option sourceSchemaOpt = sourceSchemaSupplier.get();

Review Comment:
   added a test to validate that the supplier is called per invocation of the 
method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908864979

   
   ## CI report:
   
   * f401ab103abf2eb6e2827a98fa8627795642f064 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22147)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]

2024-01-24 Thread via GitHub


yihua commented on code in PR #10549:
URL: https://github.com/apache/hudi/pull/10549#discussion_r1465442627


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -120,6 +121,7 @@ private void validateIdentifier(String id, Set 
identifiers, String confi
 
   private StructType getExpectedTransformedSchema(TransformerInfo 
transformerInfo, JavaSparkContext jsc, SparkSession sparkSession,
   Option 
incomingStructOpt, Option> rowDatasetOpt, TypedProperties 
properties) {
+Option sourceSchemaOpt = sourceSchemaSupplier.get();

Review Comment:
   Could you add a test of a scenario where the schema is evolved from the 
schema provider, and the `getExpectedTransformedSchema` returns the updated 
`StructType` instance?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10500:
URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908713547

   
   ## CI report:
   
   * 93deb5002c4379e20f9aef4813d5ae3100513e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22146)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908713120

   
   ## CI report:
   
   * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908688740

   
   ## CI report:
   
   * 8d999c7e7946d2dc3d05e8bd7ebf53d5d5e8a57a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21558)
 
   * f401ab103abf2eb6e2827a98fa8627795642f064 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22147)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10500:
URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908621790

   
   ## CI report:
   
   * edf05d8127d2281fb7ad62747f494f0f3a2e9b2c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22128)
 
   * 93deb5002c4379e20f9aef4813d5ae3100513e11 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22146)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908621294

   
   ## CI report:
   
   * 8d999c7e7946d2dc3d05e8bd7ebf53d5d5e8a57a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21558)
 
   * f401ab103abf2eb6e2827a98fa8627795642f064 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10500:
URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908608813

   
   ## CI report:
   
   * edf05d8127d2281fb7ad62747f494f0f3a2e9b2c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22128)
 
   * 93deb5002c4379e20f9aef4813d5ae3100513e11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2024-01-24 Thread via GitHub


bvaradar commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908605720

   Fixed Conflicts and updated the diff


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


bvaradar commented on code in PR #10554:
URL: https://github.com/apache/hudi/pull/10554#discussion_r1465256370


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala:
##
@@ -2160,6 +2174,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
|union
|select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 
12 as `hour`
|""".stripMargin)
+  val stageClassName = classOf[HoodieSparkEngineContext].getSimpleName
+  spark.sparkContext.addSparkListener(new 
StageParallelismListener(stageName = stageClassName))

Review Comment:
   @xuzifu666 Can you have StageParallelismListener update a shared counter 
(static) and assert here that the count increased by atleast one to ensure 
StageParallelismListener was indeed called as expected ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908511796

   
   ## CI report:
   
   * d5c669fdb2b061ff6e65b42aa969be2902c033c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21730)
 
   * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10344:
URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908495665

   
   ## CI report:
   
   * d5c669fdb2b061ff6e65b42aa969be2902c033c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21730)
 
   * f0d32bea4e960cd85b8e344597ec4f006c213b44 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.

2024-01-24 Thread Aditya Goenka (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka closed HUDI-7330.
---
Fix Version/s: (was: 1.1.0)
   Resolution: Duplicate

Duplicate of already known bug - https://issues.apache.org/jira/browse/HUDI-6472

 

So cancelling this

> With 0.14 upgrade, MIT failing with mismatched case in field names.
> ---
>
> Key: HUDI-7330
> URL: https://issues.apache.org/jira/browse/HUDI-7330
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Critical
>
> With 0.14.0 upgrade, MIT is failing when the case of the fields do not match.
>  
> Reproducible Code - 
> create table merge_source (
> id int, name string, price double
> ) using hudi
> tblproperties
> (primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, 
> "new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table (
>   id INT,
>   name STRING,
>   price DOUBLE
> ) USING hudi
>  tblproperties
> (primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, 
> "oldid2", 200.00);
> merge into hudi_table as target
> using merge_source as source
> on target.id = source.id
> when matched then update set ID=source.ID, name=source.name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6464) Implement Spark SQL Merge Into for tables without primary key

2024-01-24 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler closed HUDI-6464.
-
Resolution: Fixed

> Implement Spark SQL Merge Into for tables without primary key
> -
>
> Key: HUDI-6464
> URL: https://issues.apache.org/jira/browse/HUDI-6464
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Merge Into currently only matches on the primary key which pkless tables 
> don't have



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.

2024-01-24 Thread Aditya Goenka (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810480#comment-17810480
 ] 

Aditya Goenka commented on HUDI-7330:
-

Github issue - [https://github.com/apache/hudi/issues/10558]

> With 0.14 upgrade, MIT failing with mismatched case in field names.
> ---
>
> Key: HUDI-7330
> URL: https://issues.apache.org/jira/browse/HUDI-7330
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> With 0.14.0 upgrade, MIT is failing when the case of the fields do not match.
>  
> Reproducible Code - 
> create table merge_source (
> id int, name string, price double
> ) using hudi
> tblproperties
> (primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, 
> "new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table (
>   id INT,
>   name STRING,
>   price DOUBLE
> ) USING hudi
>  tblproperties
> (primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, 
> "oldid2", 200.00);
> merge into hudi_table as target
> using merge_source as source
> on target.id = source.id
> when matched then update set ID=source.ID, name=source.name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]

2024-01-24 Thread via GitHub


ad1happy2go commented on issue #10559:
URL: https://github.com/apache/hudi/issues/10559#issuecomment-1908435279

   @rishabhreply Sorry, but I am a bit confused. Do you really want to use 
insert_overwrite in this case? If you just submit two parallel jobs with 
insert_overwrite, one is going to overwrite the others data in any case. Even 
if you sequentially then also you will miss the data ingested by first one. So 
you can only use insert_overwrite if you want to process all 10 files in one 
batch.
   
   Let me know in case I am not thinking in right direction
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.

2024-01-24 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-7330:
---

 Summary: With 0.14 upgrade, MIT failing with mismatched case in 
field names.
 Key: HUDI-7330
 URL: https://issues.apache.org/jira/browse/HUDI-7330
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Aditya Goenka
 Fix For: 1.1.0


With 0.14.0 upgrade, MIT is failing when the case of the fields do not match.

 

Reproducible Code - 


create table merge_source (
id int, name string, price double
) using hudi
tblproperties
(primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, 
"new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table (
  id INT,
  name STRING,
  price DOUBLE
) USING hudi
 tblproperties
(primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, 
"oldid2", 200.00);

merge into hudi_table as target
using merge_source as source
on target.id = source.id
when matched then update set ID=source.ID, name=source.name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-24 Thread via GitHub


bk-mz commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748

   >What do you think about,
   
   TBH a bit of mixed emotions here.
   
   With 0.14 there is practically no way in understanding how indexing or 
statistical means are affecting queries apart from "output number of rows" in 
Spark SQL dataframe, i.e. are they used at all and if they are, how effectively?
   
   This issue could be closed, from out end we'll move further with assumption 
that indexing and statistical means in hudi are ineffective, though we'd enable 
them on our critical fields in case further releases of hudi would implement 
performance improvements.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10554:
URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908112403

   
   ## CI report:
   
   * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN
   * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6230] Handle aws glue partition index [hudi]

2024-01-24 Thread via GitHub


parisni commented on code in PR #8743:
URL: https://github.com/apache/hudi/pull/8743#discussion_r1464900587


##
hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java:
##
@@ -432,6 +443,120 @@ public void createTable(String tableName,
 }
   }
 
+  /**
+   * This will manage partitions indexes. Users can activate/deactivate them 
on existing tables.
+   * Removing index definition, will result in dropping the index.
+   * 
+   * reference doc for partition indexes:
+   * 
https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html#partition-index-getpartitions
+   *
+   * @param tableName
+   */
+  public void managePartitionIndexes(String tableName) throws 
ExecutionException, InterruptedException {
+if (!config.getBooleanOrDefault(META_SYNC_PARTITION_INDEX_FIELDS_ENABLE)) {
+  // deactivate indexing if enabled
+  if (getPartitionIndexEnable(tableName)) {
+LOG.warn("Deactivating partition indexing");

Review Comment:
   yes. The suggestion to use moto to mock aws glue is great. However it does 
not support partition index right now. So moto should be considered as a basis 
for IT in the hudi-aws, but not in this PR.
   
   BTW I tested this and provided a python script for people to try this out



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10554:
URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908034214

   
   ## CI report:
   
   * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN
   * 6af5f8ec3a1fb5459e4a0eb65f9ed152b4bbab2c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22117)
 
   * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


hudi-bot commented on PR #10554:
URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908022512

   
   ## CI report:
   
   * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN
   * 6af5f8ec3a1fb5459e4a0eb65f9ed152b4bbab2c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22117)
 
   * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]

2024-01-24 Thread via GitHub


xuzifu666 commented on code in PR #10554:
URL: https://github.com/apache/hudi/pull/10554#discussion_r1464799184


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala:
##
@@ -2160,6 +2172,7 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
|union
|select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 
12 as `hour`
|""".stripMargin)
+  spark.sparkContext.addSparkListener(new 
StageParallelismListener(stageName = "collect at 
HoodieSparkEngineContext.java"))

Review Comment:
   Hi,I had try dependent to the class,in this case all query stage would 
relate to HoodieSparkEngineContext class,so change it with this class to check 
@bvaradar PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


CamelliaYjli commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907801206

   > Yeah, you should use `HoodieHiveInputFormat` or 
HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a 
refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb
   
   OK,thx ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]

2024-01-24 Thread via GitHub


rishabhreply opened a new issue, #10559:
URL: https://github.com/apache/hudi/issues/10559

   **Describe the problem you faced**
   
   It is not a problem but rather a question that I could not find in FAQs. 
Please let me know if it is unacceptable to ask here.
   
   I have data coming in multiple files (let's say 10 files) for one table and 
all will have same value in partition_column. My setup is state machine with 
Glue parallelization enabled. Lets say I have set a batch size=2 and 
concurrency=5 in state machine, this will mean the state machine will trigger 5 
parallel glue job instances and give each instance 2 files to process. I am 
using **insert_overwrite** hudi method.
   
   Q1. In this setting how will Hudi work as not all glue job instances might 
finish at the same time? Will I see any Hudi errors? Or will it "overwrite" the 
data written by the glue job instances that finished earlier?
   
   
   **Environment Description**
   
   * Hudi version : 
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


danny0405 closed issue #10486: [SUPPORT] Flink write to COW Hudi table,hive 
aggregate query results has duplicate data but select * did not
URL: https://github.com/apache/hudi/issues/10486


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-24 Thread via GitHub


danny0405 commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907724040

   Yeah, you should use `HoodieHiveInputFormat` or 
HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a 
refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [Docs] updated button size so join now is on one line (#10557)

2024-01-24 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 539402c0638 [Docs] updated button size so join now is on one line 
(#10557)
539402c0638 is described below

commit 539402c06387111b3e3ce8c243120e419de27d8e
Author: nadine farah 
AuthorDate: Wed Jan 24 01:16:51 2024 -0800

[Docs] updated button size so join now is on one line (#10557)
---
 website/src/components/EventFeature/styles.module.css | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/website/src/components/EventFeature/styles.module.css 
b/website/src/components/EventFeature/styles.module.css
index 416f746d6e5..ff6deb0db25 100644
--- a/website/src/components/EventFeature/styles.module.css
+++ b/website/src/components/EventFeature/styles.module.css
@@ -28,5 +28,5 @@
   font-weight: bold; 
   display: inline-block; 
   text-align: center;
-  min-width: 230px
+  min-width: 280px
 }
\ No newline at end of file



Re: [PR] updated button size so join now is on one line [hudi]

2024-01-24 Thread via GitHub


danny0405 merged PR #10557:
URL: https://github.com/apache/hudi/pull/10557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] is it possible to read/write hudi files with another programming language? [hudi]

2024-01-24 Thread via GitHub


schlichtanders commented on issue #7446:
URL: https://github.com/apache/hudi/issues/7446#issuecomment-1907707785

   Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming 
is super attractive for me. Neither delta-rs nor iceberg have it as far as I 
knew...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7311) Comparing date with date literal in string format causes class cast exception during filter push down

2024-01-24 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7311:
-
Fix Version/s: 1.0.0

> Comparing date with date literal in string format causes class cast exception 
> during filter push down
> -
>
> Key: HUDI-7311
> URL: https://issues.apache.org/jira/browse/HUDI-7311
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Given any table with arbitrary field typed date (e.g. field d_date with type 
> of date). And execute the SQL with conditions for this field in where clause.
> {code:sql}
> select d_date from xxx where d_date = '2020-01-01'
> {code}
> An exception will occur:
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
> at 
> org.apache.hudi.source.ExpressionPredicates.toParquetPredicate(ExpressionPredicates.java:613)
> at 
> org.apache.hudi.source.ExpressionPredicates.access$100(ExpressionPredicates.java:64)
> at 
> org.apache.hudi.source.ExpressionPredicates$ColumnPredicate.filter(ExpressionPredicates.java:226)
> at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
> at 
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
> Hudi Flink cannot convert the date literal in String format to Integer (the 
> primitive type of date). However this SQL in Flink without Hudi works well.
> In summary, we should add literal type auto conversion before filter push 
> down.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7311] Add implicit literal type conversion before filter push down (#10531)

2024-01-24 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 77833cdb096 [HUDI-7311] Add implicit literal type conversion before 
filter push down (#10531)
77833cdb096 is described below

commit 77833cdb09661b2cdac740520b51a29264afd9c7
Author: Paul Zhang 
AuthorDate: Wed Jan 24 17:15:07 2024 +0800

[HUDI-7311] Add implicit literal type conversion before filter push down 
(#10531)
---
 .../apache/hudi/source/ExpressionPredicates.java   |   4 +-
 .../apache/hudi/util/ImplicitTypeConverter.java| 134 +
 .../hudi/source/TestExpressionPredicates.java  |  61 ++
 3 files changed, 198 insertions(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
index 8faf705a81f..58ee59a8176 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
@@ -26,6 +26,7 @@ import 
org.apache.flink.table.expressions.ValueLiteralExpression;
 import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
 import org.apache.flink.table.functions.FunctionDefinition;
 import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.hudi.util.ImplicitTypeConverter;
 import org.apache.parquet.filter2.predicate.FilterPredicate;
 import org.apache.parquet.filter2.predicate.Operators;
 import org.slf4j.Logger;
@@ -223,7 +224,8 @@ public class ExpressionPredicates {
 
 @Override
 public FilterPredicate filter() {
-  return toParquetPredicate(getFunctionDefinition(), literalType, 
columnName, literal);
+  Serializable convertedLiteral = 
ImplicitTypeConverter.convertImplicitly(literalType, literal);
+  return toParquetPredicate(getFunctionDefinition(), literalType, 
columnName, convertedLiteral);
 }
 
 /**
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java
new file mode 100644
index 000..601b878655f
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.util;
+
+import org.apache.flink.table.types.logical.LogicalType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.LocalDateTime;
+import java.time.LocalTime;
+import java.time.ZoneOffset;
+import java.time.temporal.ChronoField;
+
+/**
+ * Implicit type converter for predicates push down.
+ */
+public class ImplicitTypeConverter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(ImplicitTypeConverter.class);
+
+  /**
+   * Convert the literal to the corresponding type.
+   * @param literalType The type of the literal.
+   * @param literal The literal value.
+   * @return The converted literal.
+   */
+  public static Serializable convertImplicitly(LogicalType literalType, 
Serializable literal) {
+try {
+  switch (literalType.getTypeRoot()) {
+case BOOLEAN:
+  if (literal instanceof Boolean) {
+return literal;
+  } else {
+return Boolean.valueOf(String.valueOf(literal));
+  }
+case TINYINT:
+case SMALLINT:
+case INTEGER:
+  if (literal instanceof Integer) {
+return literal;
+  } else {
+return Integer.valueOf(String.valueOf(literal));
+  }
+case BIGINT:
+  if (literal instanceof Long) {
+return literal;
+  } else if (literal instanceof Integer) {
+return new Long((Integer) literal);
+  } else {
+return Long.valueOf(String.valueOf(lite

[jira] [Closed] (HUDI-7311) Comparing date with date literal in string format causes class cast exception during filter push down

2024-01-24 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7311.

Resolution: Fixed

Fixed via master branch: 77833cdb09661b2cdac740520b51a29264afd9c7

> Comparing date with date literal in string format causes class cast exception 
> during filter push down
> -
>
> Key: HUDI-7311
> URL: https://issues.apache.org/jira/browse/HUDI-7311
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Given any table with arbitrary field typed date (e.g. field d_date with type 
> of date). And execute the SQL with conditions for this field in where clause.
> {code:sql}
> select d_date from xxx where d_date = '2020-01-01'
> {code}
> An exception will occur:
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
> at 
> org.apache.hudi.source.ExpressionPredicates.toParquetPredicate(ExpressionPredicates.java:613)
> at 
> org.apache.hudi.source.ExpressionPredicates.access$100(ExpressionPredicates.java:64)
> at 
> org.apache.hudi.source.ExpressionPredicates$ColumnPredicate.filter(ExpressionPredicates.java:226)
> at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130)
> at 
> org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
> at 
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}
> Hudi Flink cannot convert the date literal in String format to Integer (the 
> primitive type of date). However this SQL in Flink without Hudi works well.
> In summary, we should add literal type auto conversion before filter push 
> down.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7311] Add implicit literal type conversion before filter push down [hudi]

2024-01-24 Thread via GitHub


danny0405 merged PR #10531:
URL: https://github.com/apache/hudi/pull/10531


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org