Re: [I] Cloudwatch metrics not published in moving from 0.12.1 to 0.14[SUPPORT] [hudi]

2024-05-13 Thread via GitHub


ad1happy2go commented on issue #11205:
URL: https://github.com/apache/hudi/issues/11205#issuecomment-2109345823

   @ajain-cohere Can you post the complete stack trace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109343812

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Error executing Merge Or Read [hudi]

2024-05-13 Thread via GitHub


jai20242 commented on issue #11199:
URL: https://github.com/apache/hudi/issues/11199#issuecomment-2109341298

   And why does it only happen with Merge On Read? Also, I have tested the 
version 1.0.0-beta and it doesn't happen (it works well but we can't use a beta 
version in production)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599415963


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##
@@ -87,6 +87,8 @@ public class HoodieStorageConfig extends HoodieConfig {
   .withDocumentation("Lower values increase the size in bytes of metadata 
tracked within HFile, but can offer potentially "
   + "faster lookup times.");
 
+
+

Review Comment:
   remove extra lines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109334800

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11014:
URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109334306

   
   ## CI report:
   
   * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895)
 
   * 33710549e6c4071bd327ef528e17302e42bf829c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23898)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10922:
URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109334107

   
   ## CI report:
   
   * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] In hudi 0.14.0, the hoodie.properties file is modified with each micro batch. [hudi]

2024-05-13 Thread via GitHub


CaesarWangX closed issue #11200: [SUPPORT] In hudi 0.14.0, the 
hoodie.properties file is modified with each micro batch.
URL: https://github.com/apache/hudi/issues/11200


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]

2024-05-13 Thread via GitHub


CaesarWangX closed issue #11201: 
[SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager
URL: https://github.com/apache/hudi/issues/11201


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109290453

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11014:
URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109289845

   
   ## CI report:
   
   * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237)
 
   * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895)
 
   * 33710549e6c4071bd327ef528e17302e42bf829c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10922:
URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109289418

   
   ## CI report:
   
   * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


ziudu commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109284886

   I'm a newbie. It took me a while to understand why bucket join does not work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109282123

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] DELETE Statement Deleting Another Record [hudi]

2024-05-13 Thread via GitHub


Amar1404 opened a new issue, #11212:
URL: https://github.com/apache/hudi/issues/11212

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   I have duplicated keys in hudi table due to the insert statement, when I 
tried deleting the key based on a different filter both the keys were deleted
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   
   Steps to reproduce the behavior:
   
   1. Create a table using Insert two records with the same key on without 
partition table.
   2. Try to delete the record of the key in only one row by using key and 
_hoodie_commit_seqno
   3. now check the table the table will delete both the record
   4.
   
   **Expected behavior**
   
   The delete command should only delete the one row which was used for 
filtering
   
   **Environment Description**
   
   * Hudi version : 0.12.3
   
   * Spark version : 3.3
   
   * Hive version : 3
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no 
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109274982

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10922:
URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109274435

   
   ## CI report:
   
   * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599368691


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/reader/HoodieFileSliceTestUtils.java:
##
@@ -207,7 +208,7 @@ private static HoodieDataBlock createDataBlock(
 false,
 header,
 HoodieRecord.RECORD_KEY_METADATA_FIELD,
-CompressionCodecName.GZIP,
+"gzip",

Review Comment:
   Replaced such occurrences with the default config value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599366211


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -74,11 +61,10 @@
  * base file format.
  */
 public class HoodieHFileDataBlock extends HoodieDataBlock {
+  public static final String HFILE_COMPRESSION_ALGO_PARAM_KEY = 
"hfile_compression_algo";

Review Comment:
   Fixed by using `HFILE_COMPRESSION_ALGORITHM_NAME.key()` directly.  Also, I 
directly pass the String value of the config down so the String value is 
directly converted to the corresponding `Compression.Algorithm`, like 
`ParquetUtils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599351065


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java:
##
@@ -99,29 +90,17 @@ public HoodieLogBlockType getBlockType() {
 
   @Override
   protected byte[] serializeRecords(List records, 
StorageConfiguration storageConf) throws IOException {
-if (records.size() == 0) {
-  return new byte[0];
-}
-
-Schema writerSchema = new 
Schema.Parser().parse(super.getLogBlockHeader().get(HeaderMetadataType.SCHEMA));
-ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
-HoodieConfig config = new HoodieConfig();
-config.setValue(PARQUET_COMPRESSION_CODEC_NAME.key(), 
compressionCodecName.get().name());
-config.setValue(PARQUET_BLOCK_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE));
-config.setValue(PARQUET_PAGE_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE));
-config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 
1024));
-config.setValue(PARQUET_COMPRESSION_RATIO_FRACTION.key(), 
String.valueOf(expectedCompressionRatio.get()));
-config.setValue(PARQUET_DICTIONARY_ENABLED, 
String.valueOf(useDictionaryEncoding.get()));
-HoodieRecordType recordType = records.iterator().next().getRecordType();
-try (HoodieFileWriter parquetWriter = 
HoodieFileWriterFactory.getFileWriter(
-HoodieFileFormat.PARQUET, outputStream, storageConf, config, 
writerSchema, recordType)) {
-  for (HoodieRecord record : records) {
-String recordKey = getRecordKey(record).orElse(null);
-parquetWriter.write(recordKey, record, writerSchema);
-  }
-  outputStream.flush();
-}
-return outputStream.toByteArray();
+Map paramsMap = new HashMap<>();
+paramsMap.put(PARQUET_COMPRESSION_CODEC_NAME.key(), 
compressionCodecName.get());
+paramsMap.put(PARQUET_COMPRESSION_RATIO_FRACTION.key(), 
String.valueOf(expectedCompressionRatio.get()));
+paramsMap.put(PARQUET_DICTIONARY_ENABLED.key(), 
String.valueOf(useDictionaryEncoding.get()));
+
+return FileFormatUtils.getInstance(PARQUET).serializeRecordsToLogBlock(
+storageConf, records,
+new 
Schema.Parser().parse(super.getLogBlockHeader().get(HoodieLogBlock.HeaderMetadataType.SCHEMA)),

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599348814


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:
##
@@ -366,6 +382,35 @@ public void writeMetaFile(HoodieStorage storage,
 }
   }
 
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+if (records.size() == 0) {
+  return new byte[0];
+}
+
+ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
+HoodieConfig config = new HoodieConfig();
+paramsMap.entrySet().stream().forEach(entry -> 
config.setValue(entry.getKey(), entry.getValue()));
+config.setValue(PARQUET_BLOCK_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE));
+config.setValue(PARQUET_PAGE_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE));
+config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 
1024));

Review Comment:
   This PR only moves the code.  I've created a follow-up to revisit these 
hardcoded config values, HUDI-7755.  My understanding is that for log blocks, 
current settings are good enough for log scanning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock

2024-05-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7755:

Description: For serializing log records to Parquet log blocks, there are 
hardcoded config values for writing the records in parquet format 
(serializeRecordsToLogBlock)

> Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
> --
>
> Key: HUDI-7755
> URL: https://issues.apache.org/jira/browse/HUDI-7755
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.1.0
>
>
> For serializing log records to Parquet log blocks, there are hardcoded config 
> values for writing the records in parquet format (serializeRecordsToLogBlock)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock

2024-05-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7755:

Description: For serializing log records to Parquet log blocks, there are 
hardcoded config values for writing the records in parquet format 
(ParquetUtils.serializeRecordsToLogBlock).  We need to revisit this part of 
logic to see if they should be configurable.  (was: For serializing log records 
to Parquet log blocks, there are hardcoded config values for writing the 
records in parquet format (serializeRecordsToLogBlock))

> Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
> --
>
> Key: HUDI-7755
> URL: https://issues.apache.org/jira/browse/HUDI-7755
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.1.0
>
>
> For serializing log records to Parquet log blocks, there are hardcoded config 
> values for writing the records in parquet format 
> (ParquetUtils.serializeRecordsToLogBlock).  We need to revisit this part of 
> logic to see if they should be configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock

2024-05-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7755:

Fix Version/s: 1.1.0

> Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
> --
>
> Key: HUDI-7755
> URL: https://issues.apache.org/jira/browse/HUDI-7755
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock

2024-05-13 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7755:
---

 Summary: Revisit the configs in 
ParquetUtils.serializeRecordsToLogBlock
 Key: HUDI-7755
 URL: https://issues.apache.org/jira/browse/HUDI-7755
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109237024

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346193


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();
+ostream.close();
+
+return baos.toByteArray();
+  }
+
+  private Option getRecordKey(HoodieRecord record, Schema 
readerSchema, String keyFieldName) {
+return Option.ofNullable(record.getRecordKey(readerSchema, keyFieldName));
+  }
+
+  private byte[] serializeRecord(HoodieRecord record, Schema schema, String 
keyFieldName) throws IOException {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11014:
URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109236780

   
   ## CI report:
   
   * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237)
 
   * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346082


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();
+ostream.close();
+
+return baos.toByteArray();
+  }
+
+  private Option getRecordKey(HoodieRecord record, Schema 
readerSchema, String keyFieldName) {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10922:
URL: https://github.com/apache/hudi/pull/10922#issuecomment-210923

   
   ## CI report:
   
   * 1c36f92dbff0e9be085a409d28cb9403a0343781 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23866)
 
   * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599345810


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();

Review Comment:
   This is flushing the data to the `ByteArrayOutputStream` after the writer, 
and `write.close()` flushes the data internally.  This PR only moves this part 
of code from `HoodieHFileDataBlock` to `HFileUtils` class only. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on issue #11201:
URL: https://github.com/apache/hudi/issues/11201#issuecomment-2109232721

   > It seems that this issue has been fixed in version 0.14.1
   
   yeah, you got it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10922:
URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109230347

   
   ## CI report:
   
   * 1c36f92dbff0e9be085a409d28cb9403a0343781 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23866)
 
   * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109230874

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11014:
URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109230509

   
   ## CI report:
   
   * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237)
 
   * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599341130


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -35,21 +39,54 @@
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.ByteArrayOutputStream;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;
 import java.util.Set;
+import java.util.TreeMap;
+
+import static 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
 
 /**
  * Utility functions for HFile files.
  */
-public class HFileUtils extends BaseFileUtils {
-
+public class HFileUtils extends FileFormatUtils {
   private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class);
+  private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024;
+
+  /**
+   * Gets the {@link Compression.Algorithm} Enum based on the {@link 
CompressionCodec} name.
+   *
+   * @param paramsMap parameter map containing the compression codec config.
+   * @return the {@link Compression.Algorithm} Enum.
+   */
+  public static Compression.Algorithm getHFileCompressionAlgorithm(Map paramsMap) {
+String algoName = paramsMap.get(HFILE_COMPRESSION_ALGO_PARAM_KEY);
+if (algoName == null) {

Review Comment:
   Fixed.  A new test is added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109225533

   > also not a fan of the `org.apache.hudi.io.compress.` package name. But 
probably too late to change now
   
   Since the compression logic is also under the scope of IO, so we put the 
package name like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109224269

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on issue #11202:
URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109224081

   ```java
   Caused by: java.util.NoSuchElementException: FileID x of partition path 
dt=2019-02-20 does not exist.
   at 
org.apache.hudi.io.HoodieMergeHandle.getLatestBaseFile(HoodieMergeHandle.java:159)
   at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121)
   at org.apache.hudi.io.FlinkMergeHandle.(FlinkMergeHandle.java:70)
   at org.apache.hudi.io.FlinkConcatHandle.(FlinkConcatHandle.java:53)
   at 
org.apache.hudi.client.HoodieFlinkWriteClient.getOrCreateWriteHandle(HoodieFlinkWriteClient.java:557)
   at 
org.apache.hudi.client.HoodieFlinkWriteClient.insert(HoodieFlinkWriteClient.java:175)
   at 
org.apache.hudi.sink.StreamWriteFunction.lambda$initWriteFunction$0(StreamWriteFunction.java:181)
   at 
org.apache.hudi.sink.StreamWriteFunction.lambda$flushRemaining$7(StreamWriteFunction.java:461)
   ```
   
   The error msg indicates that you enabled the inline clustering for Flink, 
can you disable that and try again by using the async clustering instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11035:
URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109223951

   
   ## CI report:
   
   * 074845c216002fc00c28dcbb7720ffc05bdc7e8f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23891)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599326562


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();

Review Comment:
   Wouldn't we want to flush before closing the writer?



##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -35,21 +39,54 @@
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.ByteArrayOutputStream;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;
 import java.util.Set;
+import java.util.TreeMap;
+
+import static 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
 
 /**
  * Utility functions for HFile files.
  */
-public class HFileUtils extends BaseFileUtils {
-
+public class HFileUtils extends FileFormatUtils {
   private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class);
+  private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024;
+
+  

Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109221161

   > So if we have to choose one between spark and hive, I think spark might be 
of higher priority
   
   I agree, do you have energy to complete that suspended PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua merged PR #11208:
URL: https://github.com/apache/hudi/pull/11208


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599331946


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -58,27 +58,10 @@
 public class HoodieHadoopStorage extends HoodieStorage {
   private final FileSystem fs;
 
-  public HoodieHadoopStorage(HoodieStorage storage) {

Review Comment:
   Makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599330264


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java:
##
@@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String 
ioFactoryClass) {
 
   public abstract HoodieFileWriterFactory 
getWriterFactory(HoodieRecord.HoodieRecordType recordType);
 
+  public abstract HoodieStorage getStorage(StoragePath storagePath);
+
+  public abstract HoodieStorage getStorage(StoragePath path,

Review Comment:
   OK. We take this on separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599329583


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus 
writeStatus) {
 }
 
 long oldNumWrites = 0;
-try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(storage.getConf())
+try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf())

Review Comment:
   Sg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #10922:
URL: https://github.com/apache/hudi/pull/10922#discussion_r1599328354


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType 
pickLogDataBlockFormat() {
 }
   }
 
-  private static Map 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber,
-  
HoodieWriteConfig config, boolean addBlockIdentifier) {
-Map updatedHeader = new HashMap<>(header);
-if (addBlockIdentifier && 
!HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block 
sequence numbers only for data table.
-  updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + 
"," + blockSequenceNumber);
-}
-if (config.shouldWritePartialUpdates()) {

Review Comment:
   I fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]

2024-05-13 Thread via GitHub


CaesarWangX commented on issue #11201:
URL: https://github.com/apache/hudi/issues/11201#issuecomment-2109205844

   It seems that this issue has been fixed in version 0.14.1 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]

2024-05-13 Thread via GitHub


CaesarWangX commented on issue #11202:
URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109198672

   The reason we do not use metadata table is that in spark structured 
streaming, enabling the metadata table will affect the efficiency of micro 
batch, as there will be additional list operations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on code in PR #11035:
URL: https://github.com/apache/hudi/pull/11035#discussion_r1599312674


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##
@@ -112,6 +113,10 @@ public BaseCommitActionExecutor(HoodieEngineContext 
context, HoodieWriteConfig c
 
   public abstract HoodieWriteMetadata execute(I inputRecords);
 
+  public HoodieWriteMetadata execute(I inputRecords, Option 
sourceReadAndIndexTimer) {
+return this.execute(inputRecords);

Review Comment:
   Not sure why we need a new `#execute` interface, I see that all the impl 
executors initialize the timer on the fly while invoking this method, so why 
not just initialize the timer in the `#execute`itself.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/HoodieWriteMetadata.java:
##
@@ -34,6 +34,7 @@ public class HoodieWriteMetadata {
 
   private O writeStatuses;
   private Option indexLookupDuration = Option.empty();

Review Comment:
   Should we remove this?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java:
##
@@ -46,22 +47,31 @@ public HoodieWriteMetadata write(String instantTime,
   int configuredShuffleParallelism,
   BaseCommitActionExecutor 
executor,
   WriteOperationType operationType) {
+return this.write(instantTime, inputRecords, context, table, 
shouldCombine, configuredShuffleParallelism, executor, operationType, 
Option.empty());
+  }
+
+  public HoodieWriteMetadata write(String instantTime,
+  I inputRecords,
+  HoodieEngineContext context,
+  HoodieTable table,
+  boolean shouldCombine,
+  int configuredShuffleParallelism,
+  BaseCommitActionExecutor 
executor,
+  WriteOperationType operationType,
+  Option 
sourceReadAndIndexTimer) {
 try {
   // De-dupe/merge if needed
   I dedupedRecords =
   combineOnCondition(shouldCombine, inputRecords, 
configuredShuffleParallelism, table);
 
-  Instant lookupBegin = Instant.now();
   I taggedRecords = dedupedRecords;

Review Comment:
   Same question, why not just initialzie the timer here so that we can avoid 
to introduce a new method.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java:
##
@@ -141,8 +141,8 @@ public JavaRDD upsert(JavaRDD> 
records, String inst
 preWrite(instantTime, WriteOperationType.UPSERT, table.getMetaClient());
 HoodieWriteMetadata> result = 
table.upsert(context, instantTime, HoodieJavaRDD.of(records));
 HoodieWriteMetadata> resultRDD = 
result.clone(HoodieJavaRDD.getJavaRDD(result.getWriteStatuses()));
-if (result.getIndexLookupDuration().isPresent()) {
-  metrics.updateIndexMetrics(LOOKUP_STR, 
result.getIndexLookupDuration().get().toMillis());
+if (result.getSourceReadAndIndexDurationMs().isPresent()) {
+  metrics.updateSourceReadAndIndexMetrics(LOOKUP_STR, 
result.getSourceReadAndIndexDurationMs().get());

Review Comment:
   Should we still use `LOOKUP_STR` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109186737

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7752) Abstract serializeRecords for log writing

2024-05-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7752:
-
Labels: hoodie-storage pull-request-available  (was: hoodie-storage)

> Abstract serializeRecords for log writing
> -
>
> Key: HUDI-7752
> URL: https://issues.apache.org/jira/browse/HUDI-7752
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua opened a new pull request, #11210:
URL: https://github.com/apache/hudi/pull/11210

   ### Change Logs
   
   This PR adds a new API `serializeRecordsToLogBlock` to the `FileFormatUtils` 
class (renamed from `BaseFileUtils`), to abstract the `serializeRecords` logic 
in `HoodieParquetDataBlock` and `HoodieHFileDataBlock`.
   
   ### Impact
   
   Moves Hadoop-dependent logic of serializing Hudi records to log block 
content to the `hudi-hadoop-common` module.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109174502

   
   ## CI report:
   
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109172992

   
   ## CI report:
   
   * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23889)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]

2024-05-13 Thread via GitHub


ziudu commented on issue #11204:
URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109160408

   Hi Danny0405,
   
   I think the support for 2 hudi tables' Spark sort-merge-join with bucket 
optimization is an important feature. 
   
   Currently if we join 2 hudi tables, the bucket index's bucket information is 
not usable by spark, so shuffle is always needs. As explained in 
[8657](https://github.com/apache/hudi/pull/8657) - hashing- file naming- file 
numbering- file sorting are different.
   
   Unfortunately, according to 
https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not 
compatible with hive bucket yet. So if we have to choose one between spark and 
hive, I think spark might be of higher priority.   
  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]

2024-05-13 Thread via GitHub


CaesarWangX commented on issue #11202:
URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109155774

   Hi @danny0405 @xushiyan , We are using spark3.4.1 and hudi0.14.0. Updated 
the context and please help look into this. Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109151251

   https://github.com/apache/hudi/assets/26940621/806a9c81-a8c6-42f0-9838-07da27cb21e2;>
   CI passing for 
https://github.com/apache/hudi/commit/153de43462c5b4ac9762cb87e4ded68640995058 
commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11035:
URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109136722

   
   ## CI report:
   
   * e0d1d604a6331759903f4e825499f89afaac1d00 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23880)
 
   * 074845c216002fc00c28dcbb7720ffc05bdc7e8f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23891)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on code in PR #11035:
URL: https://github.com/apache/hudi/pull/11035#discussion_r1599280721


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java:
##
@@ -207,6 +210,13 @@ public Timer.Context getIndexCtx() {
 return indexTimer == null ? null : indexTimer.time();
   }
 
+  public Timer.Context getPreWriteTimerCtx() {
+if (config.isMetricsOn() && preWriteTimer == null) {
+  preWriteTimer = createTimer(preWriteTimerName);
+}

Review Comment:
   +1 for `source_read_and_index`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11035:
URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109130452

   
   ## CI report:
   
   * e0d1d604a6331759903f4e825499f89afaac1d00 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23880)
 
   * 074845c216002fc00c28dcbb7720ffc05bdc7e8f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109123695

   
   ## CI report:
   
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885)
 
   * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]

2024-05-13 Thread via GitHub


CaesarWangX commented on issue #11202:
URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109118458

   Hi @danny0405 , we don't need the metadata table, so as i mentioned, we set 
metadata.enable=false. We are using hudi in AWS EMR, so we don't have chance to 
use hudi0.14.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] In hudi 0.14.0, the hoodie.properties file is modified with each micro batch. [hudi]

2024-05-13 Thread via GitHub


CaesarWangX commented on issue #11200:
URL: https://github.com/apache/hudi/issues/11200#issuecomment-2109114077

   @ad1happy2go  Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (6627218f71f -> c15bdb34f89)

2024-05-13 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 6627218f71f [HUDI-7750] Move HoodieLogFormatWriter class to 
hoodie-hadoop-common module (#11207)
 add c15bdb34f89 remove a few classes from hudi-common (#11209)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/avro/HoodieBloomFilterWriteSupport.java  |  5 +++--
 .../java/org/apache/hudi/common/util/BaseFileUtils.java  |  9 -
 .../org/apache/hudi/avro/HoodieAvroWriteSupport.java | 16 +++-
 .../apache/hudi/common/util/ParquetReaderIterator.java   |  0
 .../org/apache/hudi/io/hadoop/HoodieAvroOrcWriter.java   |  3 +--
 .../org/apache/hudi/io/storage/HoodieParquetConfig.java  |  0
 .../hudi/common/util/TestParquetReaderIterator.java  |  0
 .../apache/hudi/io/hadoop/TestHoodieOrcReaderWriter.java |  2 +-
 8 files changed, 16 insertions(+), 19 deletions(-)
 rename {hudi-common => 
hudi-hadoop-common}/src/main/java/org/apache/hudi/avro/HoodieAvroWriteSupport.java
 (82%)
 rename {hudi-common => 
hudi-hadoop-common}/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.java
 (100%)
 rename {hudi-common => 
hudi-hadoop-common}/src/main/java/org/apache/hudi/io/storage/HoodieParquetConfig.java
 (100%)
 rename {hudi-common => 
hudi-hadoop-common}/src/test/java/org/apache/hudi/common/util/TestParquetReaderIterator.java
 (100%)



Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]

2024-05-13 Thread via GitHub


jonvex merged PR #11209:
URL: https://github.com/apache/hudi/pull/11209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]

2024-05-13 Thread via GitHub


jonvex commented on PR #11209:
URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109092003

   https://github.com/apache/hudi/assets/26940621/9cc2d116-aae1-4226-b769-e39ec920c1c0;>
   CI passing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6150] Support bucketing for each hive client [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on PR #8657:
URL: https://github.com/apache/hudi/pull/8657#issuecomment-2109091522

   cc @parisni Are you still on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7752) Abstract serializeRecords for log writing

2024-05-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7752:

Story Points: 2  (was: 1)

> Abstract serializeRecords for log writing
> -
>
> Key: HUDI-7752
> URL: https://issues.apache.org/jira/browse/HUDI-7752
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109080914

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885)
 
   * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109079462

   
   ## CI report:
   
   * 55ceb8d72c2eb0e23b7763102959258101a363d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23872)
 
   * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23889)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109074759

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885)
 
   * 153de43462c5b4ac9762cb87e4ded68640995058 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109072779

   
   ## CI report:
   
   * 55ceb8d72c2eb0e23b7763102959258101a363d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23872)
 
   * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]

2024-05-13 Thread via GitHub


danny0405 commented on issue #11202:
URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109071887

   Did you use Hudi 0.14.0 release? Did you enable the metadata table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109067049

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (ea4f14c2851 -> 6627218f71f)

2024-05-13 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from ea4f14c2851 [HUDI-7744] Introduce IOFactory and a config to set the 
factory (#11192)
 add 6627218f71f [HUDI-7750] Move HoodieLogFormatWriter class to 
hoodie-hadoop-common module (#11207)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/common/table/log/HoodieLogFormat.java |  9 -
 .../hudi/common/table/log/HoodieLogFormatWriter.java  | 15 ---
 2 files changed, 16 insertions(+), 8 deletions(-)
 rename {hudi-common => 
hudi-hadoop-common}/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
 (96%)



Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]

2024-05-13 Thread via GitHub


yihua merged PR #11207:
URL: https://github.com/apache/hudi/pull/11207


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]

2024-05-13 Thread via GitHub


yihua commented on PR #11207:
URL: https://github.com/apache/hudi/pull/11207#issuecomment-2109057451

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/784294ec-f41c-4078-819e-f183dd1e5559;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599229672


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -58,27 +58,10 @@
 public class HoodieHadoopStorage extends HoodieStorage {
   private final FileSystem fs;
 
-  public HoodieHadoopStorage(HoodieStorage storage) {

Review Comment:
   Yeah, I made it the getRawStorage method below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599228005


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java:
##
@@ -223,7 +222,8 @@ public static HoodieTableMetaClient 
createMetaClient(StorageConfiguration sto
*/
   public static HoodieTableMetaClient createMetaClient(Configuration conf,
String basePath) {
-return createMetaClient(HoodieStorageUtils.getStorageConfWithCopy(conf), 
basePath);
+return createMetaClient((StorageConfiguration) 
ReflectionUtils.loadClass(HADOOP_STORAGE_CONF,

Review Comment:
   yeah



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11209:
URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109025201

   
   ## CI report:
   
   * b72b023598810b9d81647fe33c1b0e7de7edf75e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23886)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109025161

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11209:
URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109018806

   
   ## CI report:
   
   * b72b023598810b9d81647fe33c1b0e7de7edf75e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109018760

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109010558

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599213734


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java:
##
@@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String 
ioFactoryClass) {
 
   public abstract HoodieFileWriterFactory 
getWriterFactory(HoodieRecord.HoodieRecordType recordType);
 
+  public abstract HoodieStorage getStorage(StoragePath storagePath);
+
+  public abstract HoodieStorage getStorage(StoragePath path,

Review Comment:
   Maybe we can just pass `FileSystemRetryConfig`? I am not very familiar with 
what this is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599209434


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/HoodieHadoopIOFactory.java:
##
@@ -19,28 +19,40 @@
 
 package org.apache.hudi.io.storage;
 
+import org.apache.hudi.common.fs.ConsistencyGuard;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.io.hadoop.HoodieAvroFileReaderFactory;
 import org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;
 
 /**
  * Creates readers and writers for AVRO record payloads.
  * Currently uses reflection to support SPARK record payloads but
  * this ability should be removed with [HUDI-7746]
  */
 public class HoodieHadoopIOFactory extends HoodieIOFactory {
+  protected final StorageConfiguration storageConf;

Review Comment:
   Can the `storageConf` member be put into `HoodieIOFactory`?



##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -58,27 +58,10 @@
 public class HoodieHadoopStorage extends HoodieStorage {
   private final FileSystem fs;
 
-  public HoodieHadoopStorage(HoodieStorage storage) {

Review Comment:
   Is this moved to somewhere else?



##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHFileRecordReader.java:
##
@@ -59,8 +59,8 @@ public HoodieHFileRecordReader(Configuration conf, InputSplit 
split, JobConf job
 StoragePath path = convertToStoragePath(fileSplit.getPath());
 StorageConfiguration storageConf = HadoopFSUtils.getStorageConf(conf);
 HoodieConfig hoodieConfig = getReaderConfigs(storageConf);
-reader = 
HoodieIOFactory.getIOFactory(storageConf).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
-.getFileReader(hoodieConfig, HadoopFSUtils.getStorageConf(conf), path, 
HoodieFileFormat.HFILE, Option.empty());
+reader = 
HoodieIOFactory.getIOFactory(HadoopFSUtils.getStorageConf(conf)).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)

Review Comment:
   nit: use `storageConf` directly?



##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java:
##
@@ -312,8 +312,8 @@ public static Schema addPartitionFields(Schema schema, 
List partitioning
   public static HoodieFileReader getBaseFileReader(Path path, JobConf conf) 
throws IOException {
 StorageConfiguration storageConf = HadoopFSUtils.getStorageConf(conf);
 HoodieConfig hoodieConfig = getReaderConfigs(storageConf);
-return 
HoodieIOFactory.getIOFactory(storageConf).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
-.getFileReader(hoodieConfig, HadoopFSUtils.getStorageConf(conf), 
convertToStoragePath(path));
+return 
HoodieIOFactory.getIOFactory(HadoopFSUtils.getStorageConf(conf)).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)

Review Comment:
   Same here for using `storageConf`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599210706


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus 
writeStatus) {
 }
 
 long oldNumWrites = 0;
-try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(storage.getConf())
+try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf())

Review Comment:
   yes. storage.getConf() uses the fs which might use the cached conf as we 
discussed before with the issues with iofactory class config going missing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599210077


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroFileReaderFactory.java:
##
@@ -36,60 +34,48 @@
 import java.io.IOException;
 
 public class HoodieAvroFileReaderFactory extends HoodieFileReaderFactory {
-  public static final String HBASE_AVRO_HFILE_READER = 
"org.apache.hudi.io.hadoop.HoodieHBaseAvroHFileReader";
+
+  public HoodieAvroFileReaderFactory(StorageConfiguration storageConf) {
+super(storageConf);
+  }
 
   @Override
-  protected HoodieFileReader newParquetFileReader(StorageConfiguration 
conf, StoragePath path) {
-return new HoodieAvroParquetReader(conf, path);
+  protected HoodieFileReader newParquetFileReader(StoragePath path) {
+return new HoodieAvroParquetReader(storageConf, path);
   }
 
   @Override
   protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig,
-StorageConfiguration conf,
 StoragePath path,
 Option schemaOption) 
throws IOException {
 if (isUseNativeHFileReaderEnabled(hoodieConfig)) {
-  return new HoodieNativeAvroHFileReader(conf, path, schemaOption);
+  return new HoodieNativeAvroHFileReader(storageConf, path, schemaOption);
 }
-try {
-  if (schemaOption.isPresent()) {
-return (HoodieFileReader) 
ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER,
-new Class[] {StorageConfiguration.class, StoragePath.class, 
Option.class}, conf, path, schemaOption);
-  }
-  return (HoodieFileReader) 
ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER,
-  new Class[] {StorageConfiguration.class, StoragePath.class}, 
conf, path);
-} catch (HoodieException e) {
-  throw new IOException("Cannot instantiate HoodieHBaseAvroHFileReader", 
e);
+if (schemaOption.isPresent()) {
+  return new HoodieHBaseAvroHFileReader(storageConf, path, schemaOption);

Review Comment:
   checked, and I don't think there is anything else that we missed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599206501


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroFileReaderFactory.java:
##
@@ -36,60 +34,48 @@
 import java.io.IOException;
 
 public class HoodieAvroFileReaderFactory extends HoodieFileReaderFactory {
-  public static final String HBASE_AVRO_HFILE_READER = 
"org.apache.hudi.io.hadoop.HoodieHBaseAvroHFileReader";
+
+  public HoodieAvroFileReaderFactory(StorageConfiguration storageConf) {
+super(storageConf);
+  }
 
   @Override
-  protected HoodieFileReader newParquetFileReader(StorageConfiguration 
conf, StoragePath path) {
-return new HoodieAvroParquetReader(conf, path);
+  protected HoodieFileReader newParquetFileReader(StoragePath path) {
+return new HoodieAvroParquetReader(storageConf, path);
   }
 
   @Override
   protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig,
-StorageConfiguration conf,
 StoragePath path,
 Option schemaOption) 
throws IOException {
 if (isUseNativeHFileReaderEnabled(hoodieConfig)) {
-  return new HoodieNativeAvroHFileReader(conf, path, schemaOption);
+  return new HoodieNativeAvroHFileReader(storageConf, path, schemaOption);
 }
-try {
-  if (schemaOption.isPresent()) {
-return (HoodieFileReader) 
ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER,
-new Class[] {StorageConfiguration.class, StoragePath.class, 
Option.class}, conf, path, schemaOption);
-  }
-  return (HoodieFileReader) 
ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER,
-  new Class[] {StorageConfiguration.class, StoragePath.class}, 
conf, path);
-} catch (HoodieException e) {
-  throw new IOException("Cannot instantiate HoodieHBaseAvroHFileReader", 
e);
+if (schemaOption.isPresent()) {
+  return new HoodieHBaseAvroHFileReader(storageConf, path, schemaOption);

Review Comment:
   Good catch!  Could you check separately on all the reflection usage on 
`HoodieStorage`, readers and writers and see if they are still needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599192983


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##
@@ -105,4 +109,4 @@ private static HoodieRowParquetWriteSupport 
getHoodieRowParquetWriteSupport(Stor
 StructType structType = HoodieInternalRowUtils.getCachedSchema(schema);
 return 
HoodieRowParquetWriteSupport.getHoodieRowParquetWriteSupport(conf.unwrapAs(Configuration.class),
 structType, filter, config);
   }
-}
+}

Review Comment:
   nit: keep the new line



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkIOFactory.java:
##
@@ -20,30 +20,34 @@
 package org.apache.hudi.io.storage;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.storage.StorageConfiguration;
 
 /**
  * Creates readers and writers for SPARK and AVRO record payloads
  */
 public class HoodieSparkIOFactory extends HoodieHadoopIOFactory {
-  private static final HoodieSparkIOFactory HOODIE_SPARK_IO_FACTORY = new 
HoodieSparkIOFactory();
 
-  public static HoodieSparkIOFactory getHoodieSparkIOFactory() {
-return HOODIE_SPARK_IO_FACTORY;
+  public HoodieSparkIOFactory(StorageConfiguration storageConf) {
+super(storageConf);
+  }
+
+  public static HoodieSparkIOFactory 
getHoodieSparkIOFactory(StorageConfiguration storageConf) {
+return new HoodieSparkIOFactory(storageConf);
   }
 
   @Override
   public HoodieFileReaderFactory 
getReaderFactory(HoodieRecord.HoodieRecordType recordType) {
 if (recordType == HoodieRecord.HoodieRecordType.SPARK) {
-  return new HoodieSparkFileReaderFactory();
+  return new HoodieSparkFileReaderFactory(storageConf);
 }
 return super.getReaderFactory(recordType);
   }
 
   @Override
   public HoodieFileWriterFactory 
getWriterFactory(HoodieRecord.HoodieRecordType recordType) {
 if (recordType == HoodieRecord.HoodieRecordType.SPARK) {
-  return new HoodieSparkFileWriterFactory();
+  return new HoodieSparkFileWriterFactory(storageConf);
 }
 return super.getWriterFactory(recordType);
   }
-}
+}

Review Comment:
   Similar here for all files.



##
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java:
##
@@ -223,7 +222,8 @@ public static HoodieTableMetaClient 
createMetaClient(StorageConfiguration sto
*/
   public static HoodieTableMetaClient createMetaClient(Configuration conf,
String basePath) {
-return createMetaClient(HoodieStorageUtils.getStorageConfWithCopy(conf), 
basePath);
+return createMetaClient((StorageConfiguration) 
ReflectionUtils.loadClass(HADOOP_STORAGE_CONF,

Review Comment:
   Should this static method be moved to `hudi-hadoop-common` and directly use 
the constructor of `HadoopStorageConfiguration`?



##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java:
##
@@ -120,4 +125,4 @@ public static boolean enableBloomFilter(boolean 
populateMetaFields, HoodieConfig
 // so the class HoodieIndexConfig cannot be accessed in hudi-common, 
otherwise there will be a circular dependency problem
 || (config.contains("hoodie.index.type") && 
config.getString("hoodie.index.type").contains("BLOOM")));
   }
-}
+}

Review Comment:
   nit: keep new line.



##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java:
##
@@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String 
ioFactoryClass) {
 
   public abstract HoodieFileWriterFactory 
getWriterFactory(HoodieRecord.HoodieRecordType recordType);
 
+  public abstract HoodieStorage getStorage(StoragePath storagePath);
+
+  public abstract HoodieStorage getStorage(StoragePath path,

Review Comment:
   How much effort is required to remove this API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]

2024-05-13 Thread via GitHub


jonvex opened a new pull request, #11209:
URL: https://github.com/apache/hudi/pull/11209

   ### Change Logs
   
   These classes need to be removed from hudi common because they have hadoop 
deps.
   
   ### Impact
   
   remove hadoop from hudi-common
   
   ### Risk level (write none, low medium or high below)
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7754) Remove AvroWriteSupport and ParquetReaderIterator from hudi-common

2024-05-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7754:
-
Labels: pull-request-available  (was: )

> Remove AvroWriteSupport and ParquetReaderIterator from hudi-common
> --
>
> Key: HUDI-7754
> URL: https://issues.apache.org/jira/browse/HUDI-7754
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> 2 classes with hadoop deps that can be moved to hadoop common and aren't 
> covered by other prs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7754) Remove AvroWriteSupport and ParquetReaderIterator from hudi-common

2024-05-13 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7754:
-

 Summary: Remove AvroWriteSupport and ParquetReaderIterator from 
hudi-common
 Key: HUDI-7754
 URL: https://issues.apache.org/jira/browse/HUDI-7754
 Project: Apache Hudi
  Issue Type: Task
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler
 Fix For: 0.15.0, 1.0.0


2 classes with hadoop deps that can be moved to hadoop common and aren't 
covered by other prs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11208:
URL: https://github.com/apache/hudi/pull/11208#discussion_r1599190553


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java:
##
@@ -31,34 +31,37 @@
 
 public class HoodieSparkFileReaderFactory extends HoodieFileReaderFactory {
 
+  public HoodieSparkFileReaderFactory(StorageConfiguration storageConf) {
+super(storageConf);
+  }
+
   @Override
-  public HoodieFileReader newParquetFileReader(StorageConfiguration conf, 
StoragePath path) {
-conf.setIfUnset(SQLConf.PARQUET_BINARY_AS_STRING().key(), 
SQLConf.PARQUET_BINARY_AS_STRING().defaultValueString());
-conf.setIfUnset(SQLConf.PARQUET_INT96_AS_TIMESTAMP().key(), 
SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString());
-conf.setIfUnset(SQLConf.CASE_SENSITIVE().key(), 
SQLConf.CASE_SENSITIVE().defaultValueString());
+  public HoodieFileReader newParquetFileReader(StoragePath path) {
+storageConf.setIfUnset(SQLConf.PARQUET_BINARY_AS_STRING().key(), 
SQLConf.PARQUET_BINARY_AS_STRING().defaultValueString());
+storageConf.setIfUnset(SQLConf.PARQUET_INT96_AS_TIMESTAMP().key(), 
SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString());
+storageConf.setIfUnset(SQLConf.CASE_SENSITIVE().key(), 
SQLConf.CASE_SENSITIVE().defaultValueString());
 // Using string value of this conf to preserve compatibility across spark 
versions.
-conf.setIfUnset("spark.sql.legacy.parquet.nanosAsLong", "false");
+storageConf.setIfUnset("spark.sql.legacy.parquet.nanosAsLong", "false");
 // This is a required config since Spark 3.4.0: 
SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED
 // Using string value of this conf to preserve compatibility across spark 
versions.
-conf.setIfUnset("spark.sql.parquet.inferTimestampNTZ.enabled", "true");
-return new HoodieSparkParquetReader(conf, path);
+storageConf.setIfUnset("spark.sql.parquet.inferTimestampNTZ.enabled", 
"true");
+return new HoodieSparkParquetReader(storageConf, path);
   }
 
   @Override
   protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig,
-StorageConfiguration conf,
 StoragePath path,
 Option schemaOption) 
throws IOException {
 throw new HoodieIOException("Not support read HFile");
   }
 
   @Override
-  protected HoodieFileReader newOrcFileReader(StorageConfiguration conf, 
StoragePath path) {
+  protected HoodieFileReader newOrcFileReader(StoragePath path) {
 throw new HoodieIOException("Not support read orc file");
   }
 
   @Override
   public HoodieFileReader newBootstrapFileReader(HoodieFileReader 
skeletonFileReader, HoodieFileReader dataFileReader, Option 
partitionFields, Object[] partitionValues) {
 return new HoodieSparkBootstrapFileReader(skeletonFileReader, 
dataFileReader, partitionFields, partitionValues);
   }
-}
+}

Review Comment:
   nit: keep the new line.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus 
writeStatus) {
 }
 
 long oldNumWrites = 0;
-try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(storage.getConf())
+try (HoodieFileReader reader = 
HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf())

Review Comment:
   Is there a difference between `storage.getConf()` and 
`hoodieTable.getStorageConf()`?  Probably we can remove `HoodieStorage` and 
`FileSystem` instances in the `HoodieIOHandle` (in a separate PR).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11208:
URL: https://github.com/apache/hudi/pull/11208#issuecomment-2108947219

   
   ## CI report:
   
   * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11207:
URL: https://github.com/apache/hudi/pull/11207#issuecomment-2108947189

   
   ## CI report:
   
   * 7f64c7a7d1cac235655a24ca1ff5494d5afa7a86 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23883)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #10922:
URL: https://github.com/apache/hudi/pull/10922#discussion_r1599188828


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType 
pickLogDataBlockFormat() {
 }
   }
 
-  private static Map 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber,
-  
HoodieWriteConfig config, boolean addBlockIdentifier) {
-Map updatedHeader = new HashMap<>(header);
-if (addBlockIdentifier && 
!HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block 
sequence numbers only for data table.
-  updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + 
"," + blockSequenceNumber);
-}
-if (config.shouldWritePartialUpdates()) {

Review Comment:
   @jonvex Good catch! Could you fix it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11207:
URL: https://github.com/apache/hudi/pull/11207#issuecomment-2108939810

   
   ## CI report:
   
   * 7f64c7a7d1cac235655a24ca1ff5494d5afa7a86 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #10922:
URL: https://github.com/apache/hudi/pull/10922#discussion_r1599181476


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType 
pickLogDataBlockFormat() {
 }
   }
 
-  private static Map 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber,
-  
HoodieWriteConfig config, boolean addBlockIdentifier) {
-Map updatedHeader = new HashMap<>(header);
-if (addBlockIdentifier && 
!HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block 
sequence numbers only for data table.
-  updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + 
"," + blockSequenceNumber);
-}
-if (config.shouldWritePartialUpdates()) {

Review Comment:
   @nsivabalan we need to keep this part with the partial update flag



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >