Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


yihua merged PR #11210:
URL: https://github.com/apache/hudi/pull/11210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2110903230

   
   ## CI report:
   
   * 4e3ae7175331848214a42f69ebadda04c5e9039e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23922)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2110714142

   
   ## CI report:
   
   * fec0b450a41750d5c63f2277fc2c2786b771f405 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23906)
 
   * 4e3ae7175331848214a42f69ebadda04c5e9039e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23922)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2110699728

   
   ## CI report:
   
   * fec0b450a41750d5c63f2277fc2c2786b771f405 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23906)
 
   * 4e3ae7175331848214a42f69ebadda04c5e9039e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


jonvex commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1600351544


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##
@@ -87,6 +87,8 @@ public class HoodieStorageConfig extends HoodieConfig {
   .withDocumentation("Lower values increase the size in bytes of metadata 
tracked within HFile, but can offer potentially "
   + "faster lookup times.");
 
+
+

Review Comment:
   still there it looks like



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109761333

   
   ## CI report:
   
   * fec0b450a41750d5c63f2277fc2c2786b771f405 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23906)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109534114

   
   ## CI report:
   
   * d774603bd3bf52fe1d1c956ddca42d6804ab0fd7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23901)
 
   * fec0b450a41750d5c63f2277fc2c2786b771f405 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23906)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109519759

   
   ## CI report:
   
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   * d774603bd3bf52fe1d1c956ddca42d6804ab0fd7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23901)
 
   * fec0b450a41750d5c63f2277fc2c2786b771f405 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109507502

   
   ## CI report:
   
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   * d774603bd3bf52fe1d1c956ddca42d6804ab0fd7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23901)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109420236

   
   ## CI report:
   
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   * d774603bd3bf52fe1d1c956ddca42d6804ab0fd7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23901)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109408820

   
   ## CI report:
   
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   * d774603bd3bf52fe1d1c956ddca42d6804ab0fd7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599426954


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##
@@ -87,6 +87,8 @@ public class HoodieStorageConfig extends HoodieConfig {
   .withDocumentation("Lower values increase the size in bytes of metadata 
tracked within HFile, but can offer potentially "
   + "faster lookup times.");
 
+
+

Review Comment:
   My bad.  Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109343812

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599415963


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java:
##
@@ -87,6 +87,8 @@ public class HoodieStorageConfig extends HoodieConfig {
   .withDocumentation("Lower values increase the size in bytes of metadata 
tracked within HFile, but can offer potentially "
   + "faster lookup times.");
 
+
+

Review Comment:
   remove extra lines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109334800

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109290453

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109282123

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   * 4442f34765c904d3995fd5047c2e8a6197525c5b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109274982

   
   ## CI report:
   
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599368691


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/reader/HoodieFileSliceTestUtils.java:
##
@@ -207,7 +208,7 @@ private static HoodieDataBlock createDataBlock(
 false,
 header,
 HoodieRecord.RECORD_KEY_METADATA_FIELD,
-CompressionCodecName.GZIP,
+"gzip",

Review Comment:
   Replaced such occurrences with the default config value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599366211


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -74,11 +61,10 @@
  * base file format.
  */
 public class HoodieHFileDataBlock extends HoodieDataBlock {
+  public static final String HFILE_COMPRESSION_ALGO_PARAM_KEY = 
"hfile_compression_algo";

Review Comment:
   Fixed by using `HFILE_COMPRESSION_ALGORITHM_NAME.key()` directly.  Also, I 
directly pass the String value of the config down so the String value is 
directly converted to the corresponding `Compression.Algorithm`, like 
`ParquetUtils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599351065


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java:
##
@@ -99,29 +90,17 @@ public HoodieLogBlockType getBlockType() {
 
   @Override
   protected byte[] serializeRecords(List records, 
StorageConfiguration storageConf) throws IOException {
-if (records.size() == 0) {
-  return new byte[0];
-}
-
-Schema writerSchema = new 
Schema.Parser().parse(super.getLogBlockHeader().get(HeaderMetadataType.SCHEMA));
-ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
-HoodieConfig config = new HoodieConfig();
-config.setValue(PARQUET_COMPRESSION_CODEC_NAME.key(), 
compressionCodecName.get().name());
-config.setValue(PARQUET_BLOCK_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE));
-config.setValue(PARQUET_PAGE_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE));
-config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 
1024));
-config.setValue(PARQUET_COMPRESSION_RATIO_FRACTION.key(), 
String.valueOf(expectedCompressionRatio.get()));
-config.setValue(PARQUET_DICTIONARY_ENABLED, 
String.valueOf(useDictionaryEncoding.get()));
-HoodieRecordType recordType = records.iterator().next().getRecordType();
-try (HoodieFileWriter parquetWriter = 
HoodieFileWriterFactory.getFileWriter(
-HoodieFileFormat.PARQUET, outputStream, storageConf, config, 
writerSchema, recordType)) {
-  for (HoodieRecord record : records) {
-String recordKey = getRecordKey(record).orElse(null);
-parquetWriter.write(recordKey, record, writerSchema);
-  }
-  outputStream.flush();
-}
-return outputStream.toByteArray();
+Map paramsMap = new HashMap<>();
+paramsMap.put(PARQUET_COMPRESSION_CODEC_NAME.key(), 
compressionCodecName.get());
+paramsMap.put(PARQUET_COMPRESSION_RATIO_FRACTION.key(), 
String.valueOf(expectedCompressionRatio.get()));
+paramsMap.put(PARQUET_DICTIONARY_ENABLED.key(), 
String.valueOf(useDictionaryEncoding.get()));
+
+return FileFormatUtils.getInstance(PARQUET).serializeRecordsToLogBlock(
+storageConf, records,
+new 
Schema.Parser().parse(super.getLogBlockHeader().get(HoodieLogBlock.HeaderMetadataType.SCHEMA)),

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599348814


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:
##
@@ -366,6 +382,35 @@ public void writeMetaFile(HoodieStorage storage,
 }
   }
 
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+if (records.size() == 0) {
+  return new byte[0];
+}
+
+ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
+HoodieConfig config = new HoodieConfig();
+paramsMap.entrySet().stream().forEach(entry -> 
config.setValue(entry.getKey(), entry.getValue()));
+config.setValue(PARQUET_BLOCK_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE));
+config.setValue(PARQUET_PAGE_SIZE.key(), 
String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE));
+config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 
1024));

Review Comment:
   This PR only moves the code.  I've created a follow-up to revisit these 
hardcoded config values, HUDI-7755.  My understanding is that for log blocks, 
current settings are good enough for log scanning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109237024

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346193


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();
+ostream.close();
+
+return baos.toByteArray();
+  }
+
+  private Option getRecordKey(HoodieRecord record, Schema 
readerSchema, String keyFieldName) {
+return Option.ofNullable(record.getRecordKey(readerSchema, keyFieldName));
+  }
+
+  private byte[] serializeRecord(HoodieRecord record, Schema schema, String 
keyFieldName) throws IOException {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346082


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();
+ostream.close();
+
+return baos.toByteArray();
+  }
+
+  private Option getRecordKey(HoodieRecord record, Schema 
readerSchema, String keyFieldName) {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599345810


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();

Review Comment:
   This is flushing the data to the `ByteArrayOutputStream` after the writer, 
and `write.close()` flushes the data internally.  This PR only moves this part 
of code from `HoodieHFileDataBlock` to `HFileUtils` class only. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109230874

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   * 1e7ab5d044f35d65670bb0fc442721e01a677d8d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599341130


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -35,21 +39,54 @@
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.ByteArrayOutputStream;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;
 import java.util.Set;
+import java.util.TreeMap;
+
+import static 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
 
 /**
  * Utility functions for HFile files.
  */
-public class HFileUtils extends BaseFileUtils {
-
+public class HFileUtils extends FileFormatUtils {
   private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class);
+  private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024;
+
+  /**
+   * Gets the {@link Compression.Algorithm} Enum based on the {@link 
CompressionCodec} name.
+   *
+   * @param paramsMap parameter map containing the compression codec config.
+   * @return the {@link Compression.Algorithm} Enum.
+   */
+  public static Compression.Algorithm getHFileCompressionAlgorithm(Map paramsMap) {
+String algoName = paramsMap.get(HFILE_COMPRESSION_ALGO_PARAM_KEY);
+if (algoName == null) {

Review Comment:
   Fixed.  A new test is added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109225533

   > also not a fan of the `org.apache.hudi.io.compress.` package name. But 
probably too late to change now
   
   Since the compression logic is also under the scope of IO, so we put the 
package name like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109224269

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


jonvex commented on code in PR #11210:
URL: https://github.com/apache/hudi/pull/11210#discussion_r1599326562


##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() {
   public void writeMetaFile(HoodieStorage storage, StoragePath filePath, 
Properties props) throws IOException {
 throw new UnsupportedOperationException("HFileUtils does not support 
writeMetaFile");
   }
+
+  @Override
+  public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf,
+   List records,
+   Schema writerSchema,
+   Schema readerSchema,
+   String keyFieldName,
+   Map paramsMap) 
throws IOException {
+Compression.Algorithm compressionAlgorithm = 
getHFileCompressionAlgorithm(paramsMap);
+HFileContext context = new HFileContextBuilder()
+.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE)
+.withCompression(compressionAlgorithm)
+.withCellComparator(new HoodieHBaseKVComparator())
+.build();
+
+Configuration conf = storageConf.unwrapAs(Configuration.class);
+CacheConfig cacheConfig = new CacheConfig(conf);
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
+
+// Use simple incrementing counter as a key
+boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, 
keyFieldName).isPresent();
+// This is set here to avoid re-computing this in the loop
+int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 
1 : -1;
+
+// Serialize records into bytes
+Map> sortedRecordsMap = new TreeMap<>();
+
+Iterator itr = records.iterator();
+int id = 0;
+while (itr.hasNext()) {
+  HoodieRecord record = itr.next();
+  String recordKey;
+  if (useIntegerKey) {
+recordKey = String.format("%" + keyWidth + "s", id++);
+  } else {
+recordKey = getRecordKey(record, readerSchema, keyFieldName).get();
+  }
+
+  final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
+  // If key exists in the map, append to its list. If not, create a new 
list.
+  // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
+  List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
+  recordBytesList.add(recordBytes);
+  // Put the updated list back into the map
+  sortedRecordsMap.put(recordKey, recordBytesList);
+}
+
+HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
+.withOutputStream(ostream).withFileContext(context).create();
+
+// Write the records
+sortedRecordsMap.forEach((recordKey, recordBytesList) -> {
+  for (byte[] recordBytes : recordBytesList) {
+try {
+  KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, 
recordBytes);
+  writer.append(kv);
+} catch (IOException e) {
+  throw new HoodieIOException("IOException serializing records", e);
+}
+  }
+});
+
+writer.appendFileInfo(
+getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), 
getUTF8Bytes(readerSchema.toString()));
+
+writer.close();
+ostream.flush();

Review Comment:
   Wouldn't we want to flush before closing the writer?



##
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##
@@ -35,21 +39,54 @@
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.io.ByteArrayOutputStream;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;
 import java.util.Set;
+import java.util.TreeMap;
+
+import static 
org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
 
 /**
  * Utility functions for HFile files.
  */
-public class HFileUtils extends BaseFileUtils {
-
+public class HFileUtils extends FileFormatUtils {
   private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class);
+  private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024;
+
+  /

Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #11210:
URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109186737

   
   ## CI report:
   
   * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]

2024-05-13 Thread via GitHub


yihua opened a new pull request, #11210:
URL: https://github.com/apache/hudi/pull/11210

   ### Change Logs
   
   This PR adds a new API `serializeRecordsToLogBlock` to the `FileFormatUtils` 
class (renamed from `BaseFileUtils`), to abstract the `serializeRecords` logic 
in `HoodieParquetDataBlock` and `HoodieHFileDataBlock`.
   
   ### Impact
   
   Moves Hadoop-dependent logic of serializing Hudi records to log block 
content to the `hudi-hadoop-common` module.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org